Methods of Correlation 


Mordecai Ezekiel 

Head, Eeonomtes Departmtnl 
Food and Agrtadlure Organization 
of iht UntUd Nations 


Karl A. Fox 

Htad, Deportment of Economics 

and Sociohgy, 

low Slate Unnerstly 




THIRD EDITION 


and Regression Analysis 


Linear and 
Curvilinear - 


New York • John Wiley & SonS/ Inc. 


London • Sydney 




PV73 

P. V. 83 

V. 


r »:;pj 

feS3 

^ .V A \ ^ 

HJltlV ^ Urti 


\(A 


\ rr>UT^^l3Ai% 


Ccpyrighf, 1930. 1941 by Mvrdtcei EztkkI 
undtr lh« tiik Meibedj pf Corrtlotion Analyil* 


Cppyrfphl © 1959 by John Wllty & Sent, Ine. 


All Right# Reserved This book or any part 
thereof must not be reproduced m any form 
without the wntten permission of the publisher. 


Library pf Cangrpst Colelpg Card Number, 59-117^ 


frlntid tn IN Unliad Slaitt of Amprka 



Preface to Third Edition 


Thirty years have elapsed since ;the~G(riginal-, edition of this book 
was written — years of political tensions and upheaval and of enormous 
progress in technical develop.ment. This last has been reflected in 
changes in some of the examples cited — from an automobile with 
two-wheel brakes in the 1920’s to the orbit of an earth satellite in the 
late 1950’s, and from methods for using hand calculators and card 
tabulators to those for electronic computers. Despite this technical 
progress, the basic elements of correlation analysis continue unchanged. 
The major emphasis, however, has shifted from correlation to regres- 
sion, and the wide range of uses of the method in varied fields has led 
to many specialized applications or modifications. This is especially 
true in econometrics. Here the long controversy over mutually inter- 
correlated variables has finally produced an effective simultaneous- 
equation method for dealing with situations where single-equation 
solutions are inadequate; but apparently such situations are relatively 
infrequent. 

In this third edition the senior author has been fortunate in securing 
the collaboration of an associate who has made distinguished contribu- 
tions in these newer aspects of the field, particularly in their applica- 
tion to problems of actual research. The new chapter on simultaneous- 
equation solutions. Chapter 24, is one of his contributions. He is also 
responsible for the extended treatment of the analysis of variance in 
relation to regression problems (Chapter 23), the modernization of 
the chapter on standard errors in multiple regression (Chapter 17), 
and a complete revision of the treatment of error formulas for time 


V 



Preface to Tliird Edition 

senes (Chapter 20) as well as for many other contnbutions throughout 
the text 

In the revision methods of determining regressions by algebraic 
equations have been given first consideration and graphic approxima 
tion methods have been treated second with due consideration of the 
limitations of each Modern terminology espiecialiy m the econo 
metric field has been recognized and used where appropriate and the 
presentation of sampling and of confidence intervals has been mod 
ernized 

The order of presentation has been rearranged and grouped under 
seven major sections We hope this will make the general develop 
ment clearer both to students and to teachers Modern methods o! 
calculating correlation and regression constants are outlined in a new 
chapter and methods of using electronic calculators for this purpose 
are briefly treated The chapter on uses of correlation analysis has 
been recast and extended to cover newer fields in which the method is 
now widely used as well as recent work and developments in fields 
long familiar Examples from other countries and also from the 
international field m which the senior author has been working this 
past decade have been introduced here and elsewhere in the book 
The treatment of standard errors and their meaning for statistics 
derived from small samples has been matenally revised Introduction 
of adjustments for correlation coefficients and indexes to remove bias 
due to hmited degrees of freedom has been deferred untdafate chapter 
in the book (Chapter 17) 

Despite these innovations the general simplicity of expression and 
explanation has been retained as far as possible Mathematical 
derivations have been relegated to the technical appendix and many 
of the more obvious ones have been eliminated the notation has been 
kept as simple as possible and only a modest level of mathematical 
training is assumed for the reader 

As in previous editions full attention is given to multiple curinhnear 
and joint funclional regressions essential for adequate treatment of 
many problems in both natural and soaal sciences Many standard 
statistical texts still largely ignore the use of non linear regressions as 
practical working tools In reojgmtion of the increased emphasis 
given to regression analysis in general and the unusually full treat 
ment of non linear regression the title of the book has been changed 
to Methods of Correlation and Regression Analysts with the subtitle 
Linear and Curvilinear 

This book IS m part an exposition of standard statistical methods 
with no attempt to give proofs or to show their mathematical deriva 



Preface fo Third Edition 


vii 


tion. In substantial part, however, it is based upon procedures first 
developed by the senior author or by colleagues associated with him. 
In such cases chapter references indicate the professional papers in 
which the methods were first presented and proved and other important 
papers relating to these methods. 

The authors would like to express their deep appreciation to many 
fellow workers in many lands who have contributed to this revision 
by supplying suggestions, criticisms, materials, or illustrations, and 
to the many students who through the years have called attention to 
errors in the previous printings or editions. Our special thanks are 
due to John H. Smith for many helpful suggestions on the entire manu- 
script, to Martha N. Condee for calculating the examples used in 
Chapters 13 and 24, and to J. P. Cavin and R. J. Foote for making 
their computing facilities available for this purpose. We hope readers 
will again call our attention to any new errors of computation or of 
type setting that may have slipped in in this revision and in the many 
new examples introduced. 

In the first edition the senior author acknowledged his debt “to the 
spirit of research with which the Bureau of Agricultural Economics 
was imbued by the broad vision of Henry C. Taylor.” The junior 
author owes a similar, debt to the research environment that was 
maintained in the Division of Statistical and Historical Research of 
that Bureau under the leadership of O. C. Stine and J. P. Cavin, and 
to such former colleagues there as R. J. Foote, Harold F. Breimyer, 
and C. Kyle Randall, who contributed to its high standards in applied 
research during the current decade. His new colleagues at Iowa 
State University, including T. A. Bancroft, George Snedecor, and 
Emil Jebe, have helped him to appreciate the shift in emphasis from 
correlation to regression (and to the analysis of variance) that has 
taken place in a number of other sciences in addition to economics. 
The two authors are, of course, jointly responsible for the particular 
emphasis adopted in this edition and for such errors and imperfections 
as may exist in it. 

We wish to thank all our helpers in Ames, Rome, and at Wiley & 
Sons in New York, for their part in making this new edition possible. 

Mordecai Ezekiel 
Karl A. Fox 

Rome, Italy 

Ames, Iowa, 

August, 1959 



Preface to First Edition 


This book is not intended to cover the entire field of statistics, but 
rather, as its name indicates, that part of the field which is concerned 
with studying the relations between variables. The first two chapters 
are devoted to a brief review of the central elements in the measure- 
ment of variability in a statistical series, and to the essential concepts 
in judging the reliability of conclusions. These chapters are not to 
be regarded as a full statement, but instead as brief summaries to 
clarify the basic ideas which are involved in the subsequent develop- 
ment. 

No attempt is made in the body of the text to present the mathe- 
matical theory on which the art of statistical analysis is based. In- 
stead, the aim throughout has been to show how the various methods 
may be employed in practical research work, what their limitations 
are, and what the results really mean. Only the simplest of algebraic 
statements have been employed, and the practical procedure for each 
operation has been worked out step by step. It is believed that the 
material will be readily comprehensible to anyone who has had courses 
in elementary algebra. 

Although the examples which are used in presenting the several 
methods are drawn very largely from the author’s own field of agri- 
cultural economics, the methods themselves are explained in suffi- 
ciently general terms so that they can be applied in any field. In 
addition, two chapters are devoted to a discussion of the types of 
problems in a great many different fields of work to which correlation 
analysis has been successfully applied, and to research methods and 

ix 


Preface fo First Edition 


the place of correlation analysis m research It is hoped that this 
presentation will assist research workers m many fields to appreciate 
both the possibilities and the limitations of correlation analysis, and 
so gam from their data knowledge of all the relations which so fre- 
quently lie hidden beneath the surface 

Where the methods presented are the well established ones devel- 
oped by the fathers of the modem science, mainly the English statisti 
cians, no attempt is made to prove or derive the various formulas 
On a few crucial points, however, or where denvations not generally 
accessible are involved the derivations of the formulas are shown in 
notes in the technical appendix m the simplest manner possible 

The methods presented in this book, insofar as they constitute an 
advance over those previously available, represent largely the joint 
product of a group of young researchers in the Bureau of Agricultural 
Economics of the United States Department of Agriculture during 
the past decade The new methods include (a) the application of the 
Doolittle method to the solution of multiple correlation problems, 
greatly reducing the labor of obtaining multiple correlation results, 
and making feasible the use of multiple correlation m actual research 
work, (b) the development of approximate methods for determining 
curvilinear multiple correlations and, more recently, very rapid 
graphic methods for their determination, (c) the recognition of “joint" 
correlation, and the gradual development of methods of treating it, 
and (d) by extensive use m actual investigations, concrete demonstra- 
tion of the possibilities of these methods m research work These 
recent developments in correlation analysis are as >et largely unavail- 
able except in the original articles m technical journals One object 
of this book IS to present them in organized form, and with such 
interpretation that their significance and application may be fully 
understood 

During the last two decades the English statisticians * Student” 
and R A Fisher have been developing more exact methods of judg- 
ing the reliability of conclusions, particularly where those conclusions 
involve correlation or are based on small samples These new methods 
have as yet received but little recognition from American statisticians 
They are presented here as simply as possible, and the discussion of 
the reliability of conclusions gives them full consideration 

So many persons have helped in the years during which this book 
has been growing that it is difficult for me to enumerate them all 
First of all I should like to mention Howard R Tolley, from whom 
I received my introduction to statistics, and with whom it has been a 
constant joy to work I give him credit for much that is included 
here The very order of presentation reflects that which he worked 



Preface to First Edition 


xi 


out for his classes. In a very real sense this book is a product of the 
spirit of research with which the Bureau of Agricultural Economics 
was imbued by the broad vision of Henry C. Taylor. John D. Black 
was the first to point out some of the undeveloped phases of statistical 
analysis, and then aided with encouragement and counsel in their 
solution. Bradford B. Smith aided in the beginning of the new develop- 
ments, and his vivid im^ination and logical mind have been a con- 
stant help. Among others who have collaborated in various stages, or 
who have independently worked out various phases of the problem, 
may be mentioned Sewall Wright, Donald Bruce, Fred Waugh, Louis 
Bean, and Andrew Court. Susie White, Helen L. Lee, and Della E. 
Merrick have given intelligent, conscientious, and loyal assistance in 
the clerical work in the development and testing of each new step. 

In the preparation of the book itself I have had generous and willing 
help. Dorothea Kittredge and Bruce Mudgett have given the very 
substantial assistance of a detailed reading of the entire text, and 
many improvements in presentation and in material are due to their 
suggestions. For two terms the mimeographed manuscript has been 
used as a text in the United States Department of Agriculture Gradu- 
ate School, and the members of the class have helped me in working 
out the illustrations, in clarifying the text, and in eliminating errors. 
R. G. Hainsworth, who prepared the figures, deserves credit for the 
excellence of the graphic illustrations. O. V. Wells helped in com- 
puting many of the illustrative problems, and Corrine F. Kyle in 
verifying the arithmetic. For the laborious and exacting work of 
typing the preliminary stencils, the many revisions, and the final 
manuscript, and for her care, patience, and suggestions, I am indebted 
to my mother, Rachel Brill Ezekiel; and for editing the manuscript 
and helping in the lengthy task of proofreading, to my wife, Lucille 
Finstenvald Ezekiel. 

To all these, and to the many others who have helped me in the 
development of this work, I take this opportunity of expressing my 
obligation and my gratitude. 

For any errors in the statements made and in the theories advanced, 
I alone am of course responsible. Although the text has been checked 
painstakingly, it is hardly to be hoped that a publication of this char- 
acter will appear without some errors creeping in, in mathematics, in 
arithmetic, or in spelling. When such errors, or any ambiguities of 
statement, are noted by any reader, I would be very grateful if he 
would inform me of them. 

Mordecai Ezekiel 

Washington, D. C. 

April 20, 1930 



Contents 


SECTION I 

Introductory Concepts 


Chapter 1 

2 

3 

4 


Measuring the variability of a statistical 
series 

Judging the reliability of statistical results 
The relation between two variables, and the 
idea of function 

Determining the way one variable changes 
when another changes: (1) by the use of 
averages 


PAGE 

1 

14 

32 

44 


SECTION II 

Simple Regression, 
Linear and Curvilinear 


Chapter 5 

Determining the way one variable changes 
when another changes: (2) according to the 



straight-line function 

55 

6 

Determining the way one variable changes 
when another changes: (3) for curvilinear 



functions 

69 

7 

Measuring accuracy of estimate and degree of 



correlation 

118 


xiii 



Contents 


xiv 


8 

Practical methods for working out two-vari- 
able correlation and regression problems 

134 

9 

Three measures of correlation and regression — 
the meaning and use for each 

147 


SECTION 111 



^^ulttpIe Linear Regressions 


Chapter TO 

Determining multiple linear regressions* (1) by 
successive elimination 

151 

11 

Determining multiple regressions (2) by fit- 
ting a linear regression equation 

171 

12 

Measuring accurac> of estimate and degree of 
correlation for linear multiple regressions 

188 

13 

Practical methods for working out multi- 
variable correlation and regression problems 

199 


SECTION IV 
MuInpU Curtiltnear Regrestions 


Chepler 14 

Determining multiple cur>ilinear regressions 
by algebraic and graphic methods 

204 

15 

Measuring accuracy of estimate and degree of 
correlation for cur\ilincar multiple regressions 

249 

16 

Short-cut graphic methods of ilelermining net 
regression lines and cur\'es 

254 


SECTION V 


Signijicance of Correlalton 
and Regression Results 


Chapter 17 

The sampling significance of correlation and 
regression measures 

279 

18 

Influence of selection of sample and accuracy 
of observation on correlation and regression 



results 

306 

19 

Estimating the reliability of an individual 
forecast 

318 

20 

The use of error formulas with time series 

325 



Contenls 


XV 



SECTION VI 

Miscellaneous Special 
Regression Methods 


Chapter 21 

Measuring the relation betu'een one variable 



and two or more others operating jointly 

348 

22 

Measuring the way a dependent variable 
changes with changes in a qualitative independ- 



ent variable 

378 

23 

Cross-classification and the analysis of 



variance 

388 

24 

Fitting systems of tw'o or more simultaneous 



equations 

SECTION VII 

Uses and Philosophy of Correlation 
and Regression Analysis 

413 

Chapter 25 

Types of problems to which correlation and 



regression analysis have been applied 

434 

26 

Steps in research work, and the place of statis- 



tical analysis 

468 

Appendix I 

Glossary and important equations 

479 

2 

Methods of computation 

489 

3 

Technical notes 

531 

Author Index 

537 

Subject Index 

543 


SECTION I 


Introductory Concepts 


CHAPTER I 

Measuring the variability 
of a statistical series 


Statistical analysis is used where the thing to be studied can be reduced 
to or stated in terms of numbers. Not all the undertakings that rely 
on measurements ordinarily employ statistical analyses. In surveying, 
physics, and chemistry, for example, the particular thing being studied 
can usually be measured so closely, and varies over such a small range, that 
the true value can be established within narrow limits. But even in these 
fields, the modern work on atomic subparticles has involved the use of 
statistical concepts. In fact, the statistical concept of true value owes its 
existence to the reproducibility of measurements in fields like these. 

In many natural sciences, the problem to be studied can be simplified 
by the use of controlled experimental conditions, which permit the influence 
of various factors to be studied one at a time. In such sciences, statistical 
methods can be used to plan experiments in such a way as to make the 
conclusions most reliable with a minimum of efTort, and they can be used 
to measure the interrelations in sciences like astronomy, where the 
phenomena can be observed but not controlled.^ 

In the social sciences, there are fewer opportunities for the use of 
controlled experiments. Such sciences have to rely on statistical analysis, 
both to judge the importance of observed differences and to untangle 
the separate effects of multiple factors. Statistical analysis is used in the 
study of occurrences where the true value or relation cannot be measured 

1 W. G. Cochran and Gertrude M. Cox, Experimental Designs, 2nd ed., John Wiley 
and Sons, New York, 1957. 

R. A. Fisher, The Design of Experiments, 5th ed., Oliver and Boyd, Edinburgh and 
London, 1949. 

I 



2 Introductory Concepts 

directly or is hidden by other things The numerical statement of the 
occurrence or of the relationship cannot be obtained directly from the 
original or “raw” figures Instead, the data must be analyzed to determine 
the values desired 

The special need for analytical methods in the social sciences has been 
clearly stated by an eminent Englishman, as follows** 

Causation in social science is never simple and single as m phj sics or biology, 
but always multiple and complex It is of course true that one-to-one causation 
IS an artificial affair, only to be unearthed by isolating phenomena from their 
total background Nonetheless, this method is the most powerful weapon in the 
armory of natural science it disentangles the chaotic field of influence and 
reduces it to a senes of single causes, each of which can then be given due weight 
when the isolates are put back mto their natural interrelatedness, or when they 
are deliberately combined (as in modern electrical science and its applications) 
into new complexes unknown in nature This method of analysis is impossible in 
social science Multiple causation here is irreducible 

The problem is a two-fold one In the first place, the human mind is always 
looking for single causes for phenomena The very idea of multiple causation 
IS not only difficult, but definitely antipathetic And secondly, even when 
the social scientist has overcome this resistance, extreme practical difficulties 
remain Somehow he must disentangle the single causes from the multiple 
field of which they form an inseparable part And for this a new technique is 
necessary 

The Anthmet/c Average. The basic forms of statistical analysis 
concern the organization of quantitative information as a basis for drawing 
inferences Some of the basic work involves averaging and classifying 
data Thus, if a person were studying the yield of com m one year in 
some area, say a county, he might talk with 20 farmers picked at random 
and obtain figures, such as those m Table 1 1 , showing the yield of com 
each farmer had obtained * 

* Julian Huxley, The science of society, Piiyinm Quarterly Review, Vol 16, No 3, 
pp 348-65, summer, 1940 

* “Picked at random” means so selected that, for each observation, there is just as 
great a chance of any one farm in the universe (as here the county) beingselectedasofany 
other farm One way of making a random selection would be to pot sbps with the 
names of all the farmers in the county into a bowl, mix the slips thoroughly, and then 
have a blindfolded person draw out slips one at a time, repeating the mixing before 
each new drawing Data would then be obUined from the n farmers represented by the 
slips so drawn A sample so selecicd is known as a ‘‘random sample” or an “equal 
probability sample ” (See V G Panse and P V Sukhatme, Statis'lical Methods for 
Agricultural tyorkers, pp 36-40. Indian Council of Agricultural Research, New Delhi, 
1954) Thesetwoterms ‘univcrse,'*meaningthewho]egroupofcasesaboutwhichone 
IS interested in finding out certain facts, and “sample meaning a certain number of 
those cases, picked at random or otherwise from all those m the particular universe, are 
both used frequently in statistical wwk, and should be clearly understood 



Measuring Variability of a Statistical Series 


Table 1.1 

Yields of Corn Obtained by 20 Farmers* 


Farmer 

Yield 

Farmer 

Yield 

Farmer 

Yield 

Farmer 

Yield 


Bushels 


Bushels 


Bushels 


Bushels 


per acre 


per acre 


per acre 


per acre 

1 

39 

6 

43 

11 

39 

16 

43 

2 

35 

7 

30 

12 

45 

17 

41 

3 

48 

8 

38 

13 

36 

18 

47 

4 

40 

9 

40 

14 

33 

19 

38 

5 

37 

10 

39 

15 

41 

20 

42 


* In making a table such as this, the actual values may be “rounded off” to any 
desired extent. In this case they are rounded to the nearest whole bushel. For example, 
43 bushels represents any report of 42.5 bushels or more, and up to but not including 
43.5 bushels. If the original reports were secured to the nearest tenth bushel, this might 
be indicated by writing 42.5-43.4 instead of <13; or if secured to the nearest hundredth 
bushel, by writing 42.50-43.49. 

In performing arithmetic calculations on rounded off data, the results may always 
have a certain range of inaccuracy due to the effects of rounding. See Appendix 3, 
Note 7. 

The most natural first step in reducing such a series of obserx-ations to 
more usable shape is to find the arithmetic average — to add all the>yields 
reported and divide by the number of items. The 20 reports total 800 
bushels, or an average of 40 bushels.^ This provides a single figure into 
which one characteristic of the w’hole group is condensed. 

’ Bushels are used here to represent any other quantity in which one might be interested 
in a particular case. If we let X' represent the number of bushels reported by farmer 1, 
X' the bushels reported by farmer 2, X" the bushels by farmer 3, and so on, we can then 
represent the sum of all the reports by the expression SY (read “summation of the X's"). 
Similarly, if we use n to represent the number of obsersations W'e have obtained and use 
Afi to represent the average (or mean) number of bushels for all reports we can define the 
arithmetic mean by the formula: 

SX 

M, = ( 1 . 1 ) 

n 

This formula can be applied to anything w'e are studying, no matter whether X means 
bushels of com, inches in height, degrees of temperature, grade in a school examination, 
distance of a star, height of a hood, or any other measurable quantity; or whether there 
are 2 cases or 2 million. This is thus a perfectly general formula which can be applied to 
any given problem. As statistics is a study of general methods, so stated that they can be 
applied to particular problems as desired, it will be necessary to use many general 
formulas of this sort. The student should therefore familiarize himself with the defini- 
tions given above and wtth the way they are used in formula (l.I), so that he will be able 
to understand and use each formula as it occurs. 


^ Introductory Concepts 

But the average is not the only characteristic of the group that might 
be of interest The average would still be 40 if every one of the 20 farmers 
had had instead a yield of 40 bushels per acre, yet the mean of 20 reports 
each of 40 bushels would certainly be more reliable than the mean of 
20 reports ranging from 33 to 48 bushels, even though both did have the 
same average 

Classifying the Data. One way of showing the differences in the 
individual reports is to arrange them m some regular order If the farmers 
interviewed have simply been visited at random, and not selected so that 
those visited first represent one portion of the county and those visited 
later another portion, the order in which the records stand has nothing 
to do with their meaning As a first step to seeing just what the data do 
show, they can be rearranged in order from smallest to largest, as shown 
m Table 1 2 

Table 1.2 

Yields of Corn on 20 Farms, Arranged in Order 
OF Increasing Yields 


Bushels per acre 


33 

38 

40 

43 

35 

38 

40 

43 

36 

39 

41 

45 

36 

39 

41 

47 

37 

39 

42 

48 


It is now easier to tell from the senes something about the group of 
reports One can now sec that only 1 farmer had a yield of less than 
35 bushels per acre, and only 2 had more than 45. so that 17 out of the 
20 had 35 to 45, inclusive The senes shows, too, that 10 of the farmers 
had less than 40 bushels of com per acre and 10 had 40 or more, so that 
the figures 39 and 40 mark the middle of the number of yields reported 
If we divide each half into halves again, we see that 5 men had yields of 
37 bushels or less, 5 had yields of 43 bushels or more, whereas 10 men— 
half of those reporting— had yields of 38 to 42 bushels, inclusive This 
tells something about how variable yields were from farm to farm m the 
area from which the reports were secured— half the reports fell within this 
5-bushel range ® 

‘In statistical tenninology, the figure that divides the number of reports into 
halves— 39 5 m this case— is termed the imvAan, and the figures that divide the numbers 
into quarters— 37 5 and 42 5— are termed the lower and upper quartiles The difference 
Mtneen the two quartiles, within which the central half of the reports fall, is termed the 
interqusrtue range 



Measuring Variability of a Statistical Series 5 

Even as rearranged in Table 1.2, the 20 reports still constitute a large 
tabulation. If there were several hundred, such a listing would be so 
unwieldy that it would be difficult to use. 

Frequency Tables. The records can be studied more easily if, instead 
of writing “39” three times when there are 3 farmers with 39 bushels 
each, we simply show that each of 3 men reported 39 bushels. Similarly, 
instead of putting “40” down twice, we can show that 40 bushels were 
reported by 2 men. If this operation is performed for all the reports, the 
data can then be assembled into what is known as a “frequency table.” 
The result is shown in Table 1 .3. It gives the frequency, that is, the number 
of times that each yield of com was reported. 


Table 1.3 

Frequency Table, Showing Number of Times each Yield 
WAS Reported, by Individual Bushels 


Yield of Corn 

Number of Times 
Reported 

Yield of Corn 

Number of Times 
Reported 

Bushels 


Bushels 


33 

1 

41 

2 

34 

0 

42 

1 

35 

1 

43 

2 

36 

2 

44 

0 

37 

1 

45 

I 

38 

2 

46 

0 

39 

3 

47 

1 

40 

2 

48 

1 


In preparing such a frequency table, spaces are put in for all yields 
(such as 34 bushels) for which no reports were received, but which lie 
between the largest and the smallest report, to show clearly that no such 
yields were reported. 

Table 1.3 is an improvement on Table 1.2, but it is still pretty long — and 
if the lowest yield had happened to be 25, say, and the highest 70, it would 
have been longer still. For that reason it is frequently desirable to group 
the reports, not only for a yield of a specified number of bushels, but also 
for yields within a certain range of bushels. This is illustrated in Table 1.4, 
which gives the number of reports for groups covering 3 bushels. 

The presentation is now condensed enough so that it can be readily 
understood. It is easy to see that most of the reports fell around 35.5 




Introductory Concepts 


6 


Table 1.4 


Frequency Table, Showing 
Number of I^mes Each Yield was 
Reported, by 3-Bushel Groups 


Yield of Com 

Number of Tunes 
Reported 

Bushels 

32 5 35 4 

2 

35 5-384 

5 

38 5-41 4 

7 

41 5-44 4 

3 

44 5-47 4 

2 

47 5-50 4 

1 


to 44 4 bushels and that more fell near 40 bushels than anywhere else 
or course, the 3.bushel group is purely arbitrary, and any other convenient 
“class interval ’ as it is called m statistical terminology, could have been 
used Thus, if a 5-bushel class interval had been selected, the convenient 
groups 29 5-34 4, 34 5-39 4, 39 5-44 4, and 44 5-49 4 bushels could have 
been established, giving frequencies of 1, 9, 7, and 3 for the four groups 
Just what class interval makes the most satisfactory table for any given 
set of data depends upon how the data run and how much detail it is 
desired to show ® 

Measures of Deviation 

The Meon Deviotfon. Table I 4 shows, in fairly compact form, the 
way ihai the several individual reports fall on each side of the average 
value For some uses, however, it is desirable to have a single figure 
whic)] expresses the variation or “scatteration" of the whole group of 
reports, in just the same way that the arithmetic mean expresses the 
average yield of the whole group 

One way m which the tendency of the group to scatter either far from 
or close to the mean can be measured, is by finding out how far, on the 

* Where there is a tenden^ for the reports to grouped around certain values, such 
as 5, 10, It IS desirable to taVe ihe class intervals so as to make these values fait m the 
middle of the groups Thus, Moth a concentration on whole numbers ending jn 5 or 0, 
the groups 2 5-- 4. 7 5-12 4, J2 5-17 4, etc , maj be used 



Measuring Variability of a Statistical Series 7 

average, each report lies from the mean. Table 1.5 illustrates the way in 
which this can be done. 


Table 1.5 


Computation of Meak Deviation 


Original Report 

Mean 

Report Minus the Mean 

Bushels 

Bushels 

Bushels 

39 

40 

-1 

35 

40 

-5 

48 

40 

8 

40 

40 

0 

37 

40 

-3 

4 : 



Total 


60t 


* The remaining 15 reports are not shown in this table, though included in the total, 
t The plus and minus signs are disregarded in making this total. 


Mean deviation = 


60 bushels 
20 


= 3 bushels 


In computing the mean deviation, the plus and minus signs are dis- 
regarded in adding up the individual differences from the mean.'^ 

The new figure, 3 bushels, is the mean deviation of all the reports. It 

Before writing the general formula for the mean deviation it is first necessary to have 
some way of writing any deviation. Using X to indicate any given report, as before, and 
Mz to indicate the arithmetic average of all such reports, the small x will be used to 
Indicate the deviation of each report from the mean of all, thus: 


X~Mz = x ( 1 . 2 ) 

X' ~ Mz = x' 

X” ~Mz = x" 

and so on. 

Parallelling the previous notation, Sjarj (read “summation of all the small x s”) is 
used to indicate the sum of the values such as x, x' , x’, etc., taken without regard to 
sign. 

The mean deviation is then defined: 

Six] 

Mean deviation ti-’l 

n 

It is necessary to disregard the signs in taking this sum, as otherwise the sum would be 
zero. 



g Introductory Concepts 

shows that the 20 individual reports differed from the mean yield of 
40 bushels by an average of 3 bushels each This furnishes a single figure 
which expresses how much or how little the individual yields differed 
from the average yield If the group of 20 reports were being compared 
with another group of 20, all of 40 bushels each, the mean deviations of 
the two sets would indicate at once a striking difference in their make*up 
The second set would have a mean deviation of 0, as compared to the 
3-bushel mean deviation for the first set 
Whereas the arithmetic average is a measure of the central tendency 
of a group of reports the mean deviation is instead a measure of the 
"scatteration” of the individual reports — of their tendency to lie near to 
or far from the central value 

The Standard Deviation. The variation of a group of reports around 
the mean of the group can also be measured by another statistic which has 
certain advantages from a mathematical point of view (A statistic is 
the value of a given coefficient computed from a sample ) This measure 
IS based on the deviation of each report from the mean, just as is the mean 
deviation After the individual deviations are computed, each one is 
squared, and then added This process is shown in Table I 6 
The sum of the squared deviations is then divided by the number of 
Items included in the group, and the square root of the result is computed 
The computation is as follows * 

20 =''“' 

Standard deviation = VJ4 4 = 3 79 bushels 

* The letter s is used as the sign for the standard deviation computed from a sample 
Using X to represent individual differences from the mean as before, ** for the square of 
each of such deviations and E** for the sum of all such values the standard deviation is 
defined mathematically by the formula 


Where the arithmetic average is a fraction so that computing each individual deviation 
and squanrg it would take much anthmetic for accurate work, the standard deviation 
may be computed more easily the following formula 


Here the original X values are squared instead of the deviations from mean, or *, 
values It can be readily demonstrated algebraically that the two formulas give identical 
values for a. 



Measuring Variability of a Statistical Series 9 

The new value, 3.79 bushels, is called the standard deviation.® (It is 
sometimes called the root-mean-square deviation, because it is the square 
root of the mean of the squares of the individual deviations.) It is some- 
what larger than the average deviation. That is a relation which usual!}- 
holds — ^the process of squaring the deviations and taking the square root 
of the total tends to emphasize the largest ones more than does merely 


Table 1.6 

Computation of Sum of Squared Deviations from the Mean 


Original Report 

Mean 

Report Minus the 
Mean(= Deviation) 

Deviations Squared 

Bushels 

Bushels 

Bushels 

Bushels 

39 

40 

-1 

1 

35 

40 

-5 

25 

48 

40 

8 

64 

40 

40 

0 

0 

37 

40 

-3 

9 

Total 



288 


* The remaining 15 reports are not shown in this table, though they are included in 
the total. 

averaging their actual values. If the distribution of obsen'ations is 
“normal” or nearly “normal,” the standard deviation is about one and a 
quarter times as large as the average deviation.^® Figure 1.1 shows the 
normal distribution. 

* For a shorter method of computing the standard deviation when there is a large 
number of observations, and the data are grouped in a frequency table like Table 1.4, 
the following equation can be used: 



In this equation, d represents the departure of each group from a selected central group, 
F represents the number of cases (frequency) of reports in each group, and c represents 
the number of units in each group-interval. For the details of using this formula, see 
Note 1,1 at end of this chapter. 

A “normal distribution” is one such as that obtained from a series of observations 
of a variable influenced only by a large number of random or chance causes, each one 
smaU in proportion to the total. Thus the distribution of values secured by tossing a 
number of dice, and noting the spots at each reading, tends to conform to a “normal 
curve.” Variables composed of a large number of small, independent elements also tend 
to have a normal distribution. Since this distribution can be studied mathematically, 
it is possible to work out theoretically many of its properties. These theoretical 
characteristics of the normal curve are valuable in studying data where t e istri utions 
are nearly normal. 




jQ Introductory Concepts 

The distribution of the observations shown in Table 1 3 is fairly 
symmetncal about the mean The reports are concentrated around the 
middle values and then thm out toward the extreme values (that vs, the 
distnbution approximates normality) In such cases the standard 
deviation gives a measure of the range within which a fairly definite 
proportion of the reports will be included In a normal distnbution, the 
interval from the distance of one standard deviation below the mean to 



Fig I I This shoiw the shape of the normal frequency distribution 


the distance of one standard deviation above the mean will include about 
68 per cent of the observations In this particular case the mean is 40 0 
bushels, and the standard deviation is 3 79 bushels, so the interval will be 
from 3 8 less than 40 0 or 36 2 to 3 8 more than 40 0, or 43 8 Comparing 
this with Table I 3, we find that 4 farmers reported 36 4 or less, and 3 
reported 43 5 or more The interval 36 5 to 43 4 thus included 13 out of 
the 20 cases, or 65 per cent With only 20 observations, this comes 
reasonably close to the 68 per cent that would characterize a normal 
distribution 

For sonie uses, (he square of (he standard deviation has advantages 
over the standard deviation itself Just as the standard deviation, 3 79 
bushels m this case, may be thought of as measuring “variability,” the 
standard deviation squared, 14 4, maybe thought of as measuring “average 
squared \arjability,”or vanance 

The relation of the three measures which have been discussed m this 
chapter — the mean, the average deviation, and the standard deviation — 
IS illustrated graphically in Figure I 2 Here the frequency distribution 
shown m Table I 4 has been charted, showing the yield m bushels of com 



// 


Measuring Variability of a Statistical Series 

along the bottom of the chart and the number of reports falling in each 
group along the sides.^ 

Besides showing the number of reports included in each 3-bushel group 
by the height of the continuous line, the position of the mean in about 
the center of the group of reports is indicated, and likev/ise the numbier 
of reports included within a range of both one average deviation and of 
one standard deviation on each side of the mean. 


i r 

1 

1 

i 

1 1 


m 

■ 

1 1 
i! 


H 

SH 

n 

■HI 


1 1 Vttn + Ca-.ivi j * 

rl 1 1 1 

1 1 1 1 

1 till ! M ! 

1 



q I ' > i-l I ! ! , ; 1—1 1 

30 34 38 42 46 50 54 

Yield of corn (bushels per acre) 


Fig. 1.2. Frequency distribution of com yields, and range above and below 
the mean included by average and standard deviations. 


If a similar chart were made for a very large number of observations 
of a variable normally distributed, it would have the shape shown in 


Figure 1.1. 

Statistics and Parameters. This chapter has defined three measures 
of a series of data, the arithmetic average or mean, the average deviation, 
and the standard deviation. These were obtained from a sample of cases 
(yields on individual farms) selected at random from all the cases in the 
given universe (the county). If records were collected for the entire 
population of the universe (all the farms in the county) as m a census, 
only one fixed value could be obtained for each of these measures. These 


» Mathematically, the quantities which are measured from left to right and sh 
along the bottom of the chart, as the bushels of com are here, are called the absas , 
whereas the quantities which are measured from bottom to top and shown alot^g he 
sides, as the number of reports are here, are called the ordinates, , 

chart can be located by telling how far it is from the left side an “ ‘ chmild fall 

bottom, these two items tell exactly where any particular point in the figure - 

Thus, the line for the group from 38.5 to 41.5 bushels ^ 

farms, and the abscissas of the ends of the line are 38.5 and 41.5 bushel . 
and abscissa, taken together, are called the coordinates of a point. 




12 introductory Concepts 

true fixed values for the entire universe are designated parameters The 
values calculated from a sample drawn from the universe can only be 
estimates or approximations of the true parameters Such estimates 
are called statistics To make this difference clear, we will use a different 
notation for the values calculated from a sample and for the fixed (and 
generally unknown) true values for the universe, as follows 



Statistic from the 

Parameter for the 

Measure 

Sample 

Universe 

Anthmetic average (or mean) 



Standard deviation 

Sf 



Ldtia letters are used to represent sample values, and corresponding 
Greex letters are used to represent the parameters in the universe from 
which the sample was drawn 


Summary 

This chapter has shown (1) how a senes of measurements of any 
variable such as tne yield of com from farm to farm, can be classified into 
a frequency distribution that shows how the individual reports are distri- 
buted from high to low, (2) how an arithmetic average may be computed 
that shows the value around which all the reports center, and (3) how the 
vanation of the individual reports from the average may be summarized 
by computing the average deviation or the standard deviation, which 
serve as indicators of the variability of the items included in the particular 
senes Although these measures, especially the arithmetic average, are 
frequently of value for themselves alone, they are discussed here because it 
IS necessary to know how they are computed and what they mean before 
the next proposition to be discussed can be fully understood 


Note I I Where the number of observations is lai^, the standard deviations can 
be computed more readiJy from a grouped frequency taWe than from the individual 
Items This process is illustrated in the tabulation shown on p 13 
The standard devjatior is then calculated from the grouped data by the formula 



( 16 ) 


Substituting the values shown in the tabulation 


■ «-<»»»■ - = 1 25 


In making this computation, any convenioit group may be selected as the assumed 



Measuring Variability of a Statistical Series 


13 


Number of Deviation from Extensions 

Yield Reports, Assumed Mean, 

fL ^ dF ^ d^F 


32.5- 35.4 2 

35.5- 38.4 5 

38.5- 41.4 7 

41.5- 44.4 3 

44.5- 47.4 2 

47.5- 50.4 1 


Sums 20 


-2 -4 8 

-5 5 

0 0 0 

+ 1 3 3 

+2 4 8 

+3 3 9 


33 


mean, and the deviations of the other groups (d) in class-interval units calculated as 
departures from it. This method assumes that ail the cases in each group fall at the 
center of the group. With most variables, with a tendency toward a normal distribution, 
the average of the items in each group will fall somewhat nearer the center of the 
distribution than the midpoint of the group, so the use of this method tends to give too 
large a value for the standard deviation. The correction — c’/12 (called “Sheppard's 
correction” after its originator) makes an approximate allowance for this tendency. 
The c of the formula stands for the number of units of rfin each class interval. Where a 
unit of 1 is used for each class interval, as in this problem, the correaion becomes simply 
—ITS, to be applied to In this case, failure to use Sheppard's correction would 
increase s‘ by 5 %. If a larger number of groups is used, the effect would be less and 
therefore this correction is usually ignored in practice. 

In computing the standard deviation from a grouped frequency table, the s calculated 
will be in terms of the units in which d is expressed. In the illustration, each unit in d 
— one class interval — represents 3 units in X, since the yields were grouped in 3-busheI 
classes. The standard deviation computed in terms of class intervals, s„, is therefore 
only one-third as large as is the standard deviation in terms of X. The latter may be 
calculated from the former by multiplying 5„ by the number of units in each group. 
That is. Si = (units of X per class interval) s„. In this problem 3(1.25) = 3.75. 

The resulting value, 3.75, found by the short-cut method, is seen to be nearly the 
same as the exact value of 3.79 bushels, previously found by the longer method. The 
greater the number of cases in the sample, and the more nearly normal the distribution, 
the more time will the short-cut method save, and the more nearly will its approximate 
result agree with the exact value found by the longer method. 


CHAPTER 2 


Judging the reliability of 
statistical results 


Almost without exception, the object of a statistical study is to furnish 
a basis for generalization In a case like that discussed m the preceding 
chapter, for example, no one would be likely to visit 20 farms scattered 
all over a county simply for the purpose of finding out what the yield of 
corn was on those particular farms Instead, he might be studying the 
yield on those farms as a basis for determining what the average yield of 
corn was for all the farms m the county Stated in statistical terms, he 
would be finding out what was the average yield in a sample of farms, 
picked at random, with a view to judging what was about the average 
yield in the universe in which he was interested, that is, all the farms m the 
county 

Of course it would be possible to visit all the farmers in the county, 
find out exactly what yield each one obtained, and so get an average 
of all the yields in the whole county But this process would not only 
be expensive but also in most cases would be a pure waste of time and 
energy We need only take a large enough sample by a well-designed 
sampling method to satisfy ourselves to any desired degree of confidence 
concerning the actual average for all the farms of the county In this 
case, 100 records may enable one to determine the average yield quite as 
accurately as is necessary Obtaining records from all the several thousand 
farmers in the county might add nothing to the usefulness of the results 

Before considering ways of finding out how many records would be 
needed in any given case we might well discuss a little more fully what 
the process of statistical nferents involves Really, all that we do is to 
examine or measure a cclam group of objects, and infer from the size 
or measurement of those objects, or from the way those objects behave, 
what will be the size of oth«. r objects of the same sort, or how other objects 
14 



Judging Reliability of Statistics is 

of the same kind will behave. Thus, statistical inference is a special case 
of the logical process of induction, whereby we reason from particular 
facts about particular objects to general conclusions as to what will be 
the facts for all objects of a given class. Now of course we do not really 
know what the particular facts are for any particular object without 
actually examining that individual object. All that we can do is to separate 
certain groups of objects we know to be alike in one or more particulars, 
and then assume that they will be alike in other particulars too, even 
though we do not examine every one to prove it. 

Assumptions in Sampling. The basic assumptions upon which the 
theory of sampling rests apply both to the way the sample is obtained and 
to the material being sampled. With respect to the material sampled, 
the assumption is that there is a large universe of items subject to more 
or less uniform conditions, in that throughout the universe the individual 
items vary among themselves in response to the same causes and with 
about the same variability. With respect to the selection of sample, the 
values must be so selected (1) that there will not be any relation between 
the size of successive observations, that is, that the chances of a high 
observation being followed by another high observation will be just the 
same as of a low or a medium observation being followed by a high 
observation ; (2) that the successive items in the sample are not definitely 
selected from different portions of the universe in regular order, but are 
simply picked at random so that the chance of the occurrence of any 
particular value is the same with each successive observation in the sample; 
and (3) that the sample is not picked all from one portion of the universe, 
but that the observations are scattered through the universe by purely 
chance selection.’^ Where these assumptions are fulfilled, the sample is 
designated a “random sample,” and its reliability can be estimated by 
the methods now to be described. 

Taking up the question of how reliable a statistical average really 
is, we must first consider, “What is the meaning of reliable?'’’ If we are 
interested in com yield, for example, it is obvious that a perfectly reliable 
sample would be one whose average agreed exactly with the average yield 
in the county. But if we are interested in knowing the average yield to 

' Where the items are so selected as to represent different portions of the universe, it 
may be called a “stratified sample” ; where they are all selected from one portion of the 
universe, it may be called a “spot” sample. 

Where the universe is not completely um'form, a stratified sample tends to be more 
reliable than a random sample, and a spot sample tends to be less reliable than a random 
sample. See G. U. Yule and M. G. Kendall, Introduction to the Theory of Statistics, 
I4th ed., pp. 533-539, and P. V. Sukhatme, Sampling Theory of Surveys with Applications, 
pp. 83-1 37. Iowa State College Press, Ames, and Indian Society of Agricultural Statistics, 
New Delhi, 1953, for formulas as to the reliability of stratified samples. 



Introductory Concep 

within 1 bushel, then for that purpose the sample would be sufficient! 
reliable if its average was almost certain to come within I bushel of th 
average for the whole county 

Variations in Successive Samples. Suppose that 20 farms had bee 
visited at random, with the results presented in Chapter 1 If we wante 
to find out how near we could expect the average from that sample 1 
come to the average for the county as a whole, we might try taking anothi 
sample— visiting 20 other farms at random, and getting the average yie! 
for those 20 If the average yield of the second sample differed from tl 
average of the first sample by, say, 3 bushels we should know that boi 
could not come withm I bushel of the true average, if, however, tl 
average of the second sample came within a half bushel of the first averag 
we should be mchned to place more confidence m both of them If v 
repeated the process several times over, and all the different samples )x< 
averages falling within 1 bushel of each other— say between 39 0 ar 
40 0 bushels— then \vc should feel pretty certain that the average yie 
for the county as a whole was about 39 5 bushels 

Let us suppose that 15 more samples had been taken, each of 20 farrt 
selected at random, and that when we tabulate the 16 averages fro 
the 16 different samples we have the 16 values shown in Table 2 1 


Table 2 I 

Average Yield of Corn in One County, as Determined 
BY 16 DimRENT Samples of 20 Farms Each 


Sample 

Yield 

Sample 

Yield 


Bushels per acre Bushels per acre 


1 

400 

9 

40*3 

2 

37 5 

10 

38 9 

3 

39 3 

n 

39 3 

4 

406 

IZ 

38 0 

5 

398 

13 

39 2 

6 

41 I 

14 

409 

7 

38 3 

15 

39 1 

6 

39 6 

16 

404 


Although the 16 averages range all the way from 37 5 bushels for t 
smallest to 41 1 bushels for the hirgest, we can see that most of the 
fall around 39 or 40 bushels This is even more evident when we arran 
the 16 reports in a frequency table, as shown m Table 2 2 





Judging Reliability of Statistics 

tendency for the averages to cluster around 
39 and 40 bushels, still there are several below 38.5 and several above 40 5 

The average for the whole group is 39.5 bushels, and the standard deviation 
is 1.00 bushel. 

The fact that the standard deviation of the group of averages is 1 bushel 
tells us one thing about the way they scatter, from what we already know 
about the meaning standard deviation. It tells us that about 68 per cent 


Table 2.2 


Frequency Table Showing the Number of Times Various Average Yields 
Were Obtained out of 16 Samples, by One-Half-Bushel Groups 


Yield of Com 

Number of Aver- 
ages in Group 

Yield of Com 

Number of Aver- 
ages in Group 

Bushels 


Bushels 


37.5-37.9 

1 

39.5-39.9 

2 

38.0-38.4 

2 

40.0-40.4 

3 

38.5-38.9 

1 

40.5-40.9 

2 

39.0-39.4 

4 

41.0-41.4 

1 


of them will fall in the range between one standard deviation below the 
mean of all the averages and one standard deviation above the mean. 
In this particular case, the mean is 39.5 bushels, and the standard deviation 
is 1 bushel, so the interval of one standard deviation above and below the 
mean covers the range from approximately 38.5 bushels to 40.5 bushels. 
Checking this against the array of averages shown in Table 2.2, we find 
that this interval does include 10 out of the 16 cases, or fairly close to 
the proportion expected. 

Now let us go back to our single original average of 40 bushels, based 
on visits to the original 20 farms. What we want to know is how reliable 
that one average is. Stated another way, hov/ much is that average likely 
to be changed if the study were made over again — if another sample of 
the same size were taken? 

In Tables 2.1 and 2.2 we have seen how it might actually work out if 
we did do the study over several times. We have seen that, if the new 
averages did fall as shown in those tables, two-thirds of the new averages 
would fall within an interval of 2 bushels. Furthermore, those figures 
showed that all the different averages fell within a range of 4 bushels 
(37.5-41.5). But those conclusions were obtained only after getting 15 
more samples of 20 cases each, and making 15 new averages, one for each 


jg Introductory Concepts 

sample Is there any way to find out how much the original average is 
likely to vary from the true average without going to all the work of taking 
a number of new samples 

Estimating the Reliability of a Sample 

If we could estimate the extent to which the averages from new samples 
would be likely to s ary Hithout eier getting the new samples, then we should 
know something more about how much faith we could put m the particular 
average that we had already For example, if m the present case we knew 
that if we did go out and get a large number of new averages (such as 
those shown in Tables 2 1 and 2 2), those new averages would have a 
standard deviation of 1 bn^el, this fact would tell us at once somethms 
about how much our one average was likely to be different from the real 
average on all the farms For example we should know that about 68 
per cent of the sample averages would lie m an interval of 2 bushels (one 
standard deviation on each side of the mean of the samples) The one 
particular average that we had obtained might be any one of all those m 
a distribution like that shown m Table 2 2 If we assume that the mean 
of a large number of samples would coincide with the true average, then 
the chances would be about 68 out of 100 that our average was one of the 
averages falling within I bushel of the true mean If, on the other hand, 
we knew that the standard deviation of a group of new averages would 
probably be, say, 5 bushels, then we should know that we had only about 
2 chances out of 3 of the mean of any one sample of 20 cases coming 
within 5 bushels of the true average Obviously, when an average has 
2 chances out of 3 of coming within I bushel of the true average, it is 
much more reliable than if it had 2 chances out of 3 of coming within 
5 bushels of the true average 

Whether we can judge how much confidence we can place in a given 
average depends, therefore, on whether wc can tell what would be the 
standard deviation of a number of similar averages, computed from 
random samples of the same number of items drawn from the same 
universe If we could tell exactly what that standard deviation would 
be, we should know how much faith we could put m the average we had 
— we should know what the chances were of its being changed by more 
than a given amount if the study were made over, or if a very large sample 
were taken Even if we did not know exactly what the standard deviation 
of the whole group of similar averages would be, it would be some help if 
we knew approximately what it would be, or if we had a minimum or 
maximum value for Us size, so that there would be some measure of how 
much trust to place m the particular average 



Judging Reliability of Statistics /9 

Computing the Standard Error. Fortunately, it is possible to estimate 
with some degree of accuracy what the standard deviation of a whole 
series of averages is likely to be, if each average is computed from a sample 
of the same size and drawn at random from the same universe. Even so, 
the ability to make such an estimate is a tremendous aid to statistical 
investigators, for it affords some check on the dependability of results 
without going to the expense that would be involved in repeating every 
sample 15, 20, or more times, to make sure that a reliable result had been 
obtained. 

The method for computing the estimated standard deviation of the 
average involves just two values. These are (1) the standard deviation 
of the items in the universe from which the sample was drawn; and (2) 
the number of items in the sample. We do not know the standard devia- 
tion of the items in the universe, however, and can only estimate it from 
the standard deviation of the items in the sample. It has been determined 
that an unbiased estimate of the standard deviation in the universe can 
be made by adjusting the standard deviation observed in the sample as 
follows;^ 


Estimated standard deviation of the universe = 
In the sample considered in Chapter 1, 

= 3.79^^ = (3.79X1.026) 

= 3.89 



' Using the symbol as before to mean the standard deviation observed in the 
sample and i to represent the estimated standard deviation in the universe from which 
the sample was drawn, we can define the sample value adjusted for bias as 



( 2 . 1 ) 


It can more readily be computed by the equation 



(2.2) 


Actually, it is the value of the estimate of a- which is unbiased; and Si is the square root 
of that unbiased estimate. 

The two equations are identical, as may readily be proved by combining equations 
(2.1) and (1.4). 

When equation (1.5) is used, s may be computed 



(2.3) 


The mean from a sample has no bias, so the value adjusted for bias is identical with the 
sample value, M. 



2Q Introductory Concepts 

The standard deviation of the group of averages may next be estimated 
by dividing the estimated standard deviation in the aniverse by the square 
root of the number of cases m the sample Thus, for our original sample 
of 20 farms ® 

Standard error of the average 

estimate d standard deviation of items m the universe 
~ square root of the number of cases m the sample 

_ 3 89 bushels 

V20 

3 89 bushels 
4 47 

= 0 87 bushel 

In comparison with the 15 other averages, all shown m Table 2 1, we 
see that the standard deviation of all 16 averages was a trifle larger in 
this case than we estimated it was likely to be— 1 00 bushel, as compared 
to 0 87 bushel expected It has already been noted that where a number 
of repeated samples are actually taken, this may easily occur Since the 
observed standard deviation will vary from sample to sample, the estimate 
of the standard error of the average will also vary Even so, this estimated 
’‘standard deviation of similar averages” is an exceedingly useful figure 
Such an estimated standard deviation for an average (or any other statistic) 
IS called the standard error of that average (or other statistic) It serves 
as a standard measure to give some indication of how much such a sample 
may give results that vary from the true facts of the universe, solely as 
the result of chance fluctuations in sampling 

’ Here the symbol s denotes the standard deviation as before, the subscript x indicates 
that It IS the standard deviation of the individual items that go to make up our sample 
and the subscript M indicates that it ts the standard deviation of the means which is to 
be computed thus 

j, = standard deviation of the items in the universe, estimated by equation (2 1) (2 2), 
or (2 3) 

= estimated standard deviation of the group of averages if similar samples were 
repeated = standard error of the mean of X 

The standard error of the mean is then given by the formula 



Here just as m the previous formulas i> stands for the number of items m the original 
sample— the same items as those from which s, was computed 



Judging Reliability of Statistics 2 / 

Confidence Intervals for Large Samples. It has been shown that the 
means of successive random samples are distributed in a manner very 
close to that of a normal curve, even if the distribution of the original 
observations is far from normal.^ Where the sample size is large (30 
observations or more) and is drawn under conditions of simple sampling, 
the interval Af ± s^f will include the true mean in 68 per cent of all such 
samples. The interval M ± 2sj,j will include the true mean in 95 per cent 
of such samples, and the interval M ± 3sj,j in 99.7 per cent. These 
ranges are called confidence intervals, and indicate for any given sample 
about how much confidence we can place in the average (or other statistic) 
derived from that sample as an approximation to the universe average 
(or other parameter). 

Confidence Intervals for Small Samples. When the sample size is 
smaller than 30 observations, however, the distribution of averages from 
successive samples is somewhat different from the normal curve. The 
proportion of samples in which a stated confidence interval, such as 
M ± 2sJ^J, will include the true mean is smaller and the proportion of 
sample means departing from the true mean by more than ±2^;,^ is larger. 
The smaller the number of observations, the more serious the difference. 
The exact distributions of averages (and other statistics) from small 
samples have been worked out and tabulated for different sample sizes. 

This correction shows for specified confidence limits around the statistic 
(J, 1, 2, etc. times the standard error computed from the sample), 

what the probabilities are that those limits will include the true value 
for the universe. These proportions are given in Table 2.3 and in Figure 
2 . 1.5 

Table 2.3 shows the proportion of samples of each given size which 
yield confidence intervals of each type that actually include the parameter 
for which they were constructed. Thus, if the sample is large, and we 
state that the true average lies within one standard error of the computed 
average, we should probably be right for 7 out of 10 of such statements. 

* M. G. Kendall, The Advanced Theory of Statistics, Vol. I, p. 180, 1949. 

® Table 2.3 applies as stated only in the case of statistics such as the arithmetic average, 
which are computed from the original data by the determination of a single constant. 
Where the computation of the statistic involves simultaneously determining two constants 
from the original data, n — 1 should be used for the number of observations in the 
sample. This applies to the coefficient of regression. Where the computation of the 
statistic involves simultaneously determining a large number of constants, say j in 
number, from the original data, then Qi — y+ 1) should be used for the “number of 
observations” in Table 2.3 or Figure 2.1. Thus, for a coefficient of partial regression, 
* 12.315 obtained from a sample of 20 observations, 5 constants are involved, so 16 would 
be used as the number of observations in using Table 2.3 to judge the reliability of the 
computed value. (Subsequent chapters will explain the meaning of the new coefficients 
mentioned here.) 



22 Introductory Concepts 

(The exact proportion expected is 683 out of 1,000) If there were 20 
observations m the sample, and we made the same statement, we should 
be right 67 times out of 100 But for samples with only 2 observations, 
such a statement would be nght only 50 times out of 100 on the average 

Table 2 3 

Proportion of Samples of Each Given Size that Yield 
Specihed Confidence Intervais that Actually Include 
THE Parameter* 


Confidence 


Probability (P) for Samples 

With n Equal to 


Intervals 

2 

4 

6 

ID 

16 

20 

30 

or more 

M±0 

0 

0 

0 

0 

0 

0 

0 

A/ ± 0 50 

295 

349 

362 

371 

376 

377 

383 


500 

609 

637 

657 

667 

670 

683 

Af ± 1 5 r* 

626 

769 

806 

832 

846 

850 

866 

Af ±20j. 

705 

861 

898 

923 

936 

940 

954 

M±Z5su 

753 

912 

946 

966 

975 

978 

988 

M ± 

795 

942 

970 

985 

991 

993 

997 

A/ ± 3 5 5j, 

823 

961 

983 

993 

997 

998 

9995 

A/ ± 40jj, 

844 

972 

990 

997 

999 

999 



* Based On article bv Student Newiablesfortestmgthesignificanceofobservations 
Metro/! Vol V No 3 pp 105 120. 1923 


The estimated standard error of 0 87 bushel from our single sample of 
20 cases with an average of 40 0 bushels would therefore tell us that 
67 p“r cent of such samples would have averages which fell within an 
interval of ±0 87 bushel of the true mean If our sample was a true 
random sample we should then have 2 chances out of 3 of being right if we 
assumed that the confidence interval of 39 13 to 40 87 bushels included 
the true average yield for all the farms in the county 
It is evident from Table 2 3 that very small samples have much less 
reliability than samples of even moderate size Thus, a sample of 30 
cases has a confiderce interval of ±2sjf for P = 0 95, but for a sample 
of 10 cases, we see f'-rmi Table 2 3 (or Figure 2 1) that we have to take a 
confidence intervale ±225j'ytohavethesameprobabihty, forasample 
of 6 cases ±2 6s and for a sample 0(4 cases, ±3 Ts^t (The selection 
of 30 cases as the dividing Jme between “small” and “large” samples is 
somewhat arbitrary , but the distribution of averages from samples of 30 




Judging Reliability of Statistics 


23 



Fig. 2.1. The proportion of random samples in which the interval of the obseived 
mean plus and minus the stated multiple of the standard error computed from the 
sample, will include the true mean, for samples of stated numbers of observations. 
(To apply to coefficients of regression, see footnote 6.) 

cases is so close to the normal distribution which underlies the last column 
of Table 2.3 that the differences are generally disregarded.) 

From equation (2.4) it is clear that our estimates of would vary from 
sample to sample, as each sample would give us a different value for S^. 
The standard error of the standard error, stated in relative terms, depends 
solely upon the number of cases in the sample. It is computed as follows;® 

' Using o./oj to represent the standard error of the estimated standard error, we can 
define it by the equation 

V2(n - 1) 


a. 


(2.5) 



Introductory Concepts 


24 

Relative standard error of the standard error 

I 

~ square root of two times (number of cases m sample — 1) 

For our sample of 20 cases 

a, _ 1 _ 1 . 

o* ” ^2(20 ~ I) VS 
:=0162 

With very small samples, even the estimate of the standard error of the 
average is subject to a wide zone of uncertainty With 4 cases, its own 
standard error is 41 per cent of the value computed 

Meaning and Use of the Standard Error 

It is good statistical practice, whenever an average is cited, to give with 
that average its estimated standard error, so that the reader will know 
about how significant that average is, and will not be led into using it 
to make comparisons or to draw conclusions that are not justified by the 
number of observations which arc summed up m that average This 
may be done either by showing the average followed by ± its own standard 
error, or with its standard error shown below it in parentheses, and where 
the sample is small by stating the number of reports Thus, m the case 
we have been considenng, with the single sample showing an average of 
40 00 bushels with a standard error of 0 87 bushel, and with only 20 cases 
in the sample, the correct statement is to say “the average yield has been 
shovi 11 by the sample to be 40 0 ± 0 57 bushels (20 cases) ” If a similar 
sample from a different area has shown the average yield to be 38 i 2 0 
bushels (20 cases), the reader would know that there was a fair chance 
that the true average yield in the second area was really as large as m 
the first area m spite of the difference in the two sample averages 
The greatest value of the standard error does not he in merely indicating 
how near the sample value may come to the true value for two samples 
out of three Mathematicians have determined for large samples that 19 
out of 20 (95 45 per cent) of the samples wiJ] give averages which fall 
within two standard deviations of the mean, 369 out of 370 (99 73 per cent) 
will usually fall within three standard deviations of the mean, and all 
but one case out of 16 667 samples (99 994 per cent) will usually fall 
within /our standard deviations of the mean 
As shown m Table 2 3, when there are less than 30 observations in the 
sample, the tendency of the computed standard error to be misleading 



Judging Reliability of Statistics 25 

is greater for high odds than it is for lower odds. Thus, with samples of 
20 cases, 6 samples out of 100 will give averages differing from the true 
average by more than twice the computed standard error, and 7 
samples out of 1,000 will miss the true average by more than three standard 
errors. This last is three times the proportion of such failures which 
would occur in the long run with samples of over 30 observations. With 
very small samples, the failures for high odds occur even more frequently. 
Thus, for samples with only 4 observations, 14 samples out of 100 will 
differ from the true mean by twice the computed standard error, and 
about 6 out of 100 wiU differ by three times the standard error, on the 
average. As already explained, it is therefore necessary to take wider 
confidence intervals, in multiples of the standard error, to attain even 
approximately the same probabilities with smaller samples. 

Interpreting the Standard Error in the Illustrative Problem. We 
can interpret the statement that the average yield in the area studied was 
40 ± 0.87 bushels in any of the following ways : 

1 . If we state that the true mean lies within one standard error of the 
observed mean (between 39.13 and 40.87 bushels, in this case) each time 
we use a sample of this size, we shall be wrong in our statement 1 time 
out of 3, on the average. 

2. If we state that the true mean lies within two standard errors of the 
observed mean (between 38.26 and 41.74 bushels) each time we use a 
sample of this size, we shall be wrong in our statement 1 time out of 17, 
on the average. 

3. If we state that the true mean lies within three standard errors of the 
observed mean (between 37.39 and 42.61 bushels) each time we use a 
sample of this size, we shall be wrong in our statement 1 time out of 135, 
on the average. 

4. If we state that the true mean lies within four standard errors of the 
observed mean (between 36.52 and 43.48 bushels) each time we use a 
sample of this size, we shall be wrong in our statement only 1 time out 
of 1,250, on the average.’ 

Comparing these conclusions with the 16 samples shown in Tables 2.1 
and 2.2, we see that 2 of those samples did fall outside the limits given 
by twice the estimated standard error. If we had been so unlucky as to 
have got the worst one of these as our single sample, instead of the one 
we actually did get, then we should not have hit the average even if we 
had used a range of twice the computed standard error as that within 
which we expected the true average to lie. On the other hand, every one of 

’ These odds are calculated from tables carried to more decimal places than those 
shown in Table 2.3. 



2 ^ Introductory Concepts 

the averages fell within the range covered by three times the standard 
error Even if, m picking our single sample, we had been unfortunate 
enough to draw the poorest one of the lot— the one which gave an average 
yield of 37 5 bushels— and had used a range of three times the standard 
error, we should have been correct in our statement as to the range within 
which we expected the true average to he Then we should have concluded 
that the true mean lay somewhere between 34 3 and 40 7 bushels, which 
would have been wide enough to include the real mean Of course, if 
we had taken four times the standard error, we should have been almost 
absolutely certain of including the true mean in the stated range, with 

only one chance in over 1,000 of being wrong ~7^62S” 

For any given size of sample, each of the confidence intervals cited 
corresponds to a particular probability, P However, as P vanes with 
sample size, many statisticians first choose the value of P that they are 
willing to ‘ act on ’ in a given investigation and then select the corre- 
sponding confidence interval, which, m general, will not be an exact 
multiple of the standard error Probabilities of 095 and 099 are most 
commonly used— their use implying that the investigator is willing to 
accept the one chance m 20, or the one chance in 100, that the universe 
value lies outside the specified range Logically, the “probability of 
error” (I — P) one is willing to accept should vary with the importance 
of the actions that might be based on the results of the investigation 
For the general run of statistical problems, and with fair sized samples, 

It would seem safe to regard three times the standard error as about the 
largest extent to which the conclusions might be out solely because of 
ihe chances of gening an unusual sample in random sampling— the 
probability of error being held between 001 and 0 001 

In view of the possibility of the standard error itself being m error, 
however, the number of observations should always be stated, as well as 
the standard error of the statistic, particularly where the sample is small 

Bias In Sampling, The figure as to standard error tells nothing at all 
of how much error there may be because of bias m sampling Thus, if 
in taking our sample of 20 farms, we had visited only the largest farms 
with the most prosperous looking buildings, we should be very likely to 
get a sample which was not representative of all the farmers in the county^ 
but simply of the better ones, and so might get an average yield, consider- 
ably above the true average for the county Even if we selected our 
farmers only to the extent of including those who were most willing to 
give us the figures we wanted, we might have a badly biased sample, as 
usually the best farmers and the roost intelligent ones are most willing 
to answer such questions The only theoretically sound way to avoid 
bias IS to define the universe of interest very carefully in advance of the 



Judging Reliability of Statistics 27 

field investigation and then draw our sample in such a way that every 
item in the universe actually has an equal chance of being chosen in the 
sample. (In stratified random sampling the universe may be divided into 
two or more parts, but within each part each item should sfill have an 
equal probability of selection). In the com yield example, this might 
require that we have a list of all farms in the county on which com was 
grown, or that we number each square mile of land in the county, draw a 
random sample of these numbers, and measure the com yield on every 
farm lying wholly or partly within each of the square miles appearing in 
the sample. If the expense of obtaining a complete list of farms or drawing 
an area sample seemed too great for the purpose at hand, we might select 
our 20 farms in a less formal way, still trying consciously to have them 
represent the different yield levels around the county in about the tme 
proportions. However, in this case, we must depend largely on common 
sense and on other knowledge of the situation we are studying, and not on 
statistical computations, to tell us whether or not our sample is really 
representative of the universe we want to study. Thus, we might compare 
the average size or value of the farms in our sample with the averages for 
all the farms in the county, as shown by the census reports, to see whether 
they were representative or not in these respects. All that the computed 
standard error can tell us is about how closely the sample statistic is 
likely to approach the average (or other characteristic) of the group it does 
actually represent — whether that group is the one we meant it to represent 
or only a part of that group. This caution must always be kept in mind 
in using samples; Computed standard errors tell us how far our results 
may be off solely because of the chance of getting a poor sample with a 
limited number of cases ; but they do not tell us how far we may be off 
because of a biased sample, which is not a fair selection from the universe 
we wish to study. 

Deciding on the Size of Sample Necessary to Obtain a Stated 
Reliability. One other application of the standard-error formula remains 
to be mentioned. The way in which this formula can be used to estimate 
the reliability of the average from a given sample, when the number of 
cases is known, has already been explained. The same formula can be used 
to determine how large a sample would have to be taken in order to 
secure results within any previously assigned limits of accuracy. 

Thus it has already been shown that the records from 20 farms could 
be used to say that the true average yield lay somewhere between 37.39 
and 42.61 bushels, with about one chance in 135 of that statement’s 
being wrong. How many farms would one have to visit to state the same 
average yield to within one bushel, with the same chance of the statement’s 
being wrong? The same formula which was used to determine the 



2 g Introductory Concepts 

standard error of the average can be turned around to answer this question 
also 

If we know that we want to get an average reliable to within one bushel, 
for a range of three times its standard error, then we know that the standard 
error of that average would have to be only one-third of a bushel We 
may also assume that when we take our larger sample, the standard devia- 
tion of the yields on the individual farms will be found to be not very 
different from what it was in our sample of 20 cases, and so use the same 
standard deviation as we did before 

Tairg the relation which was used in computing the standard error 
before wc have 


In the new case v-e have the required standard error given, J bushel, 
we are assuming that the estimated standard deviation for the universe 
from our larger sample will be 3 89 bushels, just as it was from our sample 
of 20 cases Substituting these values m our equation, and using «' 
to represent the number of cases required in the new sample, we then have 


} bushel 


3 89 bushels 
Vn’ 


When 'he terms are shifted around, this becomes 



We therefore conclude that if a sample of 136 reports were obtained, 
we snould probably get an average yield which would not differ from the 
'rue average yield for all the farms by more than one bushel m more than 
one sample out of several hundreds of such samples If any other limit 
of error was set, we could similarly determine how many reports would 
probably be necessary to satisfy that limit ® 

* Because the pTobabibiy (P) associated wnh a confidence interval of three standard 
errors is larger for a sample of 136 cases than for a sample of 20 cases, we could hold 
the probability of being wrong by more than one bushel at one chance m 135 (P = 0 993) 
by choosing somewhat fewer than 136 cases Thus, from Table 2 3 it appears that m 
large samples P = 0 993 for a confidence interval of about M±2 This would 
lead to an acceptable % of about 0 37 bushels and a necessary sample size of about 
no cases This refinement is superfluous in most practical situations, if the smaller 
sample size is 30 or more and if the acceptable probability of error is at least 0 01, the 
method of calculation shown m the text is sufficiently exact 



Judging Reliability of Statistics 29 

Standard Errors for Other Measures. This whole discussion has 
been in terms of determining how closely it was possible to approximate 
the /rue average from the average shown by a sample. In exactly the same 
way standard-error formulas have been worked out indicating how 
closely it is possible to approximate the true values of other parameters 
(such as standard deviations, for example) from the values for those 
measures determined from a sample. These are shown and interpreted 
in much the same way as are the standard errors of averages; they will 
be referred to in subsequent chapters. 

Universes, Past and Present 

Any statistical measurement relates to something that is already past 
by the time the measurement can be analyzed. Thus our records of the 
yield of com obtained must relate to some crop that has already been 
harvested. Yields for a crop still growing could only be forecasts, and could 
never be precisely accurate until the crop was harvested and was weighed 
or measured. Yet human beings cannot live in the past. If we are planning 
an agricultural control program, for example, and wish to estimate how 
many acres will produce a given total bushelage, we shall be deahng with 
future years. We can do nothing to change the past. Only the future 
can be affected by our actions. When we take the average yield for a past 
year as our “universe” to be studied, what we are really interested in 
knowing is usually something about the yield most likely to be secured 
in one or in a series of years in the future. 

Analysis of what has happened in a succession of years in the past 
may help us to make a better estimate of the future. Such analysis may 
show a steady upward trend, or a variation from year to year with rainfall, 
or other variations whose cause we do not know. But before we can 
project the past trends into the future, we must try to understand what 
caused them, and judge whether those causes will continue to operate. 
These judgments are not a matter of statistical analysis as such but must 
be based upon scientific and technological study of all the forces at work. 
Thus a steady upward trend in cotton yields might reflect a rising price 
of cotton in the period studied, and a resulting increase in the quantities 
of fertilizer applied per acre. But equally well it might reflect a steady 
decrease in the total acreage (due to crop control or other causes) and 
a concentration of the remaining acreage on the better lands. Or it might 
reflect the gradual adoption of improved strains. A forecast of whether 
the upward trend would continue into the future would be materially 
different in the three cases. Besides the statistical facts, it would involve 
study of what had been happening in cotton production in the area, and 



20 introductory Concepts 

non-stalistical judgments as to whether the increase m price or the limita- 
tion of acreage or the improvement in seed was likely to continue 

Whether we are dealing with the statistical characteristics of people 
or of crops or of prices or of atoms, the real universe for which we wish 
to estimate is the universe of future events Our ability to forecast those 
events will differ widely from field to field Presumably the characteristics 
of atoms or of chemical compounds will be less subject to change than 
will those of crops or prices In each case, however, the statistical infor- 
mation gained from the study of past samples must be tempered by other 
knowledge of the situation, based on study and analysis which may be 
quite non statistical in nature When we move from the facts of the past 
to forecast the unknown universe of the future it is not the statistics but 
the statistician who is on trial Unless he mixes an ample measure of 
anthropology or agronomy or economics or other appropriate scientific 
information with his statistics — plus a liberal dash of common sense — 
he may find his analysis of past events a detriment, rather than an aid, 
in judging the future Some of the issues introduced here are discussed 
further m Chapters 17, 20, and 26 

Summary 

This chapter considers the question of how far statistical results derived 
from a selected “sample'* drawn from a universe can be used to reach 
genera! conclusions as to the facts of the entire universe 

The confidence which can be placed m any statistic computed from 
a sample, say an average, depends upon how closely that average is 
likely to come to the true average of the whole universe One way of 
determining that would be to collect additional samples, each of the same 
si7e 1-rom the way the averages from each of these different samples 
varied one could judge how near the average from any one sample was 
likely to come to the true average For samples which meet the conditions 
of simple sampling, another much more rapid way is to estimate the 
standard error of the average, which gives a basis for estimating how much 
confidence can 6e pfacerf in the observed average With large samples, the 
true average will probably be within twice the standard error from the 
observed average for 21 samples out of 22, and within three times the 
standard error 369 times out of 370 Where the number of observations 
in the sample is less than 30, the possibility of error is larger, as is indicated 
by Table 2 3 

The same formula can be used to estimate how large a sample must 
be taken to attain any desired degree of confidence in the final average 

The estimated standard error does not take into account bias in selecting 



Judging Reliability of Statistics 31 

the sample, but only indicates the limits of confidence that can be placed 
in the result even when an “honest” random sample is obtained. 

After the values in the universe have been estimated from the facts 
shown by the sample, the statistician must still remember that that 
universe is a past universe. In applying that knowledge to problems of 
future action, he must give due allowance to the fact that the as yet 
unborn universe of the future may never be identical with the past and 
dead universe from which his sample was obtained. 



CHAPTER 3 


The relation between two variables, 
and the idea of function 


Relations are the fundamental stuff out of which all science is built 
To sav that a given piece of metal weighs so many pounds is to stale a 
relationship The weight simply means that there is a certain relationship 
between the pull of gravity on that piece of metal and the pull on another 
piece which has been named the “pound ” We can tell what our “pound” 
IS only by defining it in terms of still other units, or by comparing it to 
t master lump of metal carefully sheltered in the Bureau of Standards 
If the pull IS twice as great on the given piece of metal as lUs on the standard 
pound then we say that the lump weighs 2 pounds If, further, we say 
t weighs 2 pounds per cubic inch that is stating a composite relationship, 
involving at the same lime the arbitrary units which we use to measure 
extent or distance m space and the units for mcasunng the gravitational 
force or attracting power of the earth 
Rclot/ons between Ver/obfer Besides these very simple relationships 
which are implicit in all our statements of numerical description — weight, 
i'“ngth temperature size age and so on — there are more complicated 
ic aiionships where two or more variables are concerned A variable is 
any measurable characteristic which can assume varying or difierenl 
values in successive individual cases The yield of com on different farms 
IS a vamble since it may differ widely from farm to farm So is the length 
of time which a falling body takes to reach the earth, or the quantity of 
sugar that can be dissolved in a glass of water, or the distance it takes for 
an automobile to stop after the brakes are applied, or the quantity of 
milk that one cow will produce m a year, or the tensile strength of a 
piece of metal or the length of time it takes a person to memorize a 
quotation In contrast to these tanabtes there arc other numerical values 
called constants, because they never change Thus one foot alv.a)s 
32 



The Meaning of Function 33 

contains 12 inches, one dollar always is equal to 100 cents; and a stone 
always falls 16 feet in the first second (under certain specified conditions). 
Science, of any sort, ultimately deals with the relation between variable 
factors and with the determination, where possible, of the constants 
which describe exactly what those relationships are. 

The variables which have been mentioned may be used to illustrate the 
way in which changes in one variable can be related to changes in another. 
Thus the length of time which a falling body takes to reach the earth 
varies with — ^that is, is related to — the distance through which the body 
has to fall. The quantity of sugar which can be dissolved in a glass of 
water varies both with the size of the glass and the temperature of the 
water. The distance it takes for an automobile to stop after the brakes 
are applied varies with the speed at which the car is traveling when the 
brakes are applied, the area of braking surface on the drums, the area of 
tire surface on the road, how tightly the brakes are applied, how much the 
car weighs, the kind of road, and so on. 

Then when we come to variables like the production of milk, or the 
time required to memorize a quotation, we find the situation still more 
complicated. How much milk a cow will produce varies with her age, 
breed, inherent ability, and the richness of the milk, and with the kind, 
quality, amount, and composition of the feed she receives, the way 
she is stabled and cared for, and many other similar factors. The time 
it takes to memorize a quotation may be affected by its length, the subject’s 
age, sex, training, fatigue or freshness, his familiarity with material 
discussed, and his interest in the topic. The strength of a piece of metal 
will be affected by its size, shape, composition, heat treatment, temperature, 
and so on. 

Yet it is precisely with relations between complex variables that many 
statistical studies must deal. Science deals in large part with the 
relations between variables, and with the parameters that express these 
relations. 

The statistical methods which may be used to handle such problems 
can best be understood if presented first for the simplest cases, and then 
expanded to cover the more complicated ones. Suppose a physicist 
made some experiments to determine the relation between the distance 
a body has to fall and the length of time it takes, by dropping a marble 
different distances and measuring the time it takes to reach the ground, 
and obtained the results shown in Table 3.1. 

Looking over these figures we see that there is some sort of general 
relation between the two columns. As the distance increases, the time 
increases also. But that is not uniformly true. In one case the distance 
increased without there being any increase in the recorded time : in some 



34 Introductory Concepts 

other cases the recorded time was not the same even though the distance 
was unchanged 

Table 3 I 


Relation between Distance a Marble Drops and Time 
It Takes to Fall 


Distance Traveled 

Time Elapsed 

Distance Traveled 

Time Elapsed 

Feet 

Seconds 

Feet 

Seconds 

5 

06 

20 

1 1 

5 

05 

20 

1 1 

5 

06 

20 

1 2 

10 

09 

20 

1 1 

10 

08 

25 

12 

10 

07 

25 

1 3 

IS 

1 0 

25 

1 2 

IS 

09 

25 

1 3 

IS 

1 0 




Graphic Representation of Relation between Two Variables. We 
can get a better idea of what the relation is if we “plot” it on cross*section 
paper so that we can see graphically just how the time does vary with 



1 , , 


Dot tor 25 f«t and 1 3 seconds 


— 


— 

- 

/•Dot for 5 feet 
/ and 06 seconds 

- 



- 


- 


1 1 



iff IS^ :? 

Distance fallen (feet) 

5 J? 


Fig 3 I Method of constructing a dot chart Time elapsed is the 
dependent variable, and distance is the independent variable 


the distance Figure 3 I illustrates the way this is usually done The 
units of one variable, in this case the distance to be traversed are measured 
off from the left, starting with zero m the lower left-hand corner and 
counting over toward the right The units of the other variable, in this 




The Meaning of Function 35 

case the time elapsed, are measured off from the bottom, starting with 
zero and counting up toward the top. If negative values are present, then 
the counting is started with the largest negative value, decreasing from 
left to right or from bottom to top, until zero is reached and the positive 
values begin to appear. 

Where one variable may be regarded as the cause and the other variable 
as the result, it is customary to put the causal variable along the bottom. 
In this case with the measurements made by dropping the marble known 
distances and measuring the time elapsed, it may be said that the differences 
in distance traversed cause the differences in time elapsed. Distance, 
therefore, is measured in the horizontal direction, and time in the vertical. 
Tliere is no particular reason for plotting data just this way except that 
this is the customary way of doing so. Some relations of this sort can be 
reversed, so that either may be regarded as cause and either as effect.^ 

Having laid off the chart in the way indicated, we next “plot” the 
individual observations. The way this is done is illustrated in Figure 3.1. 
The first observation was that it took 0.6 second for the marble to fall 
5 feet. This is indicated on the chart by counting over to the 5-foot line 
from the left of the chart, and then counting up along that line until 
0.6 second is reached. A dot is placed on the chart at that point. As 
indicated, this dot is at the intersection of the line starting from the 
“0.6 second” at the left of the chart and extending parallel to the “0-second” 
line, with the other line starting from “5 feet” at the bottom of the chart 
and extending parallel to the “0-foot” line. Similarly, the last observation, 
25 feet in 1.3 seconds, is indicated by a dot where the horizontal line 
representing 1.3 seconds crosses the vertical line representing 25 feet. 

Entering a dot for each individual observation in the same way, we 
get the chart shown in Figure 3.2. This figure now gives a visual represen- 
tation of the way in which the length of time changes as the distance 
traversed changes. Such a chart is Imown as a “dot chart” or a “scatter 
diagram.” 

But even this figure does not show the exact relation between the distance 
and the time. Both the first and the second trials were for exactly the 
same distance, yet the time was slightly different. Obviously that differ- 
ence in time could not have been due to the difference in distance between 
the two, because there was no difference. The investigator must therefore 
assume that some outside cause, perhaps the accuracy with which the time 
was measured, may have been responsible for these slight differences. 
It will be noted, too, that when the different observations are plotted as 
in Figure 3.2, they come close to all lying along a continuous curve, 
but do not fall exactly on it. If we are willing to assume that all the 

' For a more extended discussion of this point, see pp. 47 and 48. 



36 


Introductory Concepts 

differences between the different observations at the same point along the 
curve are due solely to extraneous factors we can estimate the true 
effect of the distance by itself, by averaging together the several observa 
tions as to time taken for each of the several tests for the same length of 
fall A continuous curve drawn through these averages would then 
indicate the way m which the duration of fall varied with the distance, 
on the average of the cases studied Although it might not hold true for 






1 — 




> 





> 





/ 

/ 

/ 

■ 






• Individual observat ons 
<• Average for group 

1 1 1 


04I ^ I ^ 1 

0 5 10 15 20 26 30 

ChsUnee fallen (feet) 


Fig 3 2 Relation of distance a marble falls to time elapsed m falling 
as shown by individual observations and curve of average time 


any one individual case, as we have just seen, still it does indicate about 
what the time will be for any given distance For practical purposes we 
may say that under given conditions the time a body takes to fall is 
determined by the distance which it has to fall, as shown by the curve 
Expressing a Functional Refotlon Atothemot/caf/y. The relation 
shown by the curve m Figure 3 2 is what mathematicians call a functional 
relationship, the time it takes a body to fall is a function of the distance 
which it has to traverse * This means that for any particular distance- 
fallen, there is some corresponding time-required The term ‘ function” 
means that there is some definite relation between the two variables 
number of feet and number of seconds, but it does not tell just v.hat 
that relationship is When, however, it is said that time is a function of 
distance according to the curve shonn tn the figure^ then the statement has 
been made specific The curve shows, for the distances shown on the 

* Using Y for time and X for distance we state this mathematically 
Y^fiX) 



The Meaning of Function 37 

chart, how long it will take a body to fall, on the average of a series of 
trials. 

In this particular case the function is defined only by the graphic curve. 
It may also be stated as a mathematical expression 

r = iA/T 

using for distance in feet and Y for time in seconds. This equation 
corresponds to the curve. If any value of X is substituted in it, and then 
the value of Y determined, that will be the value of T— time in seconds— 
corresponding to that particular value of AT— distance in feet— as shown 
by the curve in Figure 3.2. This equation is therefore the equation of 
the function, since this simple mathematical expression tells just as much 
about the relation between the two varying quantities — time and distance 
— as does the entire curve in the figure. 

The way this equation can be used may be illustrated by two examples. 
Suppose a marble falls 16 feet; how long should it take to fall? The 
value of X would then be 16; substituting this value in the equation, 
we have 

Y = i- Vl6 = i-(4) = 1 

This gives a value of 1 for Y, which means that it would take about 
1 second to fall. Suppose again a bomb were dropped from an airplane 
10,000 feet above the ground. How long would it take to reach earth? 
The value of X is then 10,000; substituting this value in the equation, 
we have 

Y = iV 10,000 = 1(100) = 25 

The result means that it would take about 25 seconds for the bomb to 
fall.3 

It is evident that the equation goes much further than does the graph 
of the curve. The latter gives the relation between distance and time only 
for the distances which are shown on the chart. The equation, on the 
other hand, gives the relation for any distance whatever, no matter what 
it may be. It is possible to state this aspect of the law of gravity in an 
equation only because physicists have studied this relation in the past 

® Outside causes, such as friction with the air, may make the time of fall slightly 
different from the calculated time; therefore with so long a fall as this the time might 
differ perceptibly from the theoretical time given by the equation. This equation gives 
the time required when no influence other than gravity is taken into account. Obviously 
a marble would fall in air much faster than a feather — the resistance of the air has very 
little influence on the speed of the marble and a great deal of influence on the speed of 
the feather. In a vacuum they would fall at the same rate. 



38 (ntroductory Concepts 

and determined exactly how, the one quantity varies with the other 
Having found that the same relation between the two variables held through 
their entire range of observation and having worked out on logical 
grounds a good reason why that relation should hold, they have felt safe 
m coming to the conclusion that it will continue to hold even beyond 
the range of the experimental verification * Where only a graph of the 
function IS available, on the contrary, only the relation within the stated 
range is known The graph does not tell, of and by itself, the direction 
or shape the curve would take if extended beyond the limits determined 
by the experiments 

Now if instead of the relation we have just been discussing we consider 
the relation between the quantity of sugar which can be dissolved m a 
glassful of water and the temperature of the water, wc have quite a 
different problem, and yet one that is similar m many aspects If we 
start to determine it experimentally, wc must first make sure that the 
quantity of water with which we arc working is the same m every trial, 
then we must measure accurately both the temperature of the water and 
the amount of sugar which could be dissolved m it Many other similar 
factors which might possibly influence the result would have to be 
considered before even the exact plan of the expenment could be 
drawn up 

Once the expenment had been run the numerical results would probably 
be somewhat similar in character to those m the gravity test If the data 
were plotted on a scatter diagram like Figure 3 2, it would be found that 
the data fell in the general shape of a curve, but that very few of the dots 
fell exactly on the curve, some lying above and some below the continuous 
line which could be drawn about through the center of them Again 
we might conclude that these slight differences from exact agreement were 
due to factors other than the temperature of the water — to slight experi- 
mental errors m the quantity or temperature of the water, or to slight 
errors of measurement in determining the quantity of sugar — and be 
willing to conclude that the line drawn through the center of the senes 
of observations showed the real effect of differences in temperature on 
the quantity of sugar dissolved, when extraneous influences were removed 
This again would be a functional relation The curve would express the 
relation between changes m temperature and changes in quantity of sugar, 
showing for any given temperature exactly how much sugar could be dis- 
solved It might then be possible to determine a type of equation which 

* It should be noted that for very great distances, say 10 000 miles, the formula might 
need to be modified, since then the pull ofthe earth would be less than it is at the surface 
The equation holds true only for those distances from the earth within which its pull 
IS practically a constant 



The Meaning of Function 29 

would accurately specify the function by a mathematical formula, similar 
to that discussed for the gravity example, if the logical type of relation 
between the two variables could be worked out.® 

Determining a Functional Relation Statistically. In the two cases 
which have been discussed the relation between the two variables was 
sufficiently close so that by taking proper experimental precautions other 
influences which might affect the result could be largely removed and a 
series of observations obtained sufficiently consistent with each other so 
that the exact nature of the relation could be readily determined. In 
many other types of relations this cannot be done so easily. It is with 
this type of relation that statistical methods really become important. 

If we were making a traffic study in a given city, for example, we might 
wish to know what would be the safe speed limits to permit on different 
streets. In that connection we might need to know in what distance an 
automobile could be stopped when traveling at different speeds, so that 
by comparing this distance with the width of the different streets and the 
length of view at intersections we could judge how fast machines might 
be able to travel without risk of collisions at street intersections. One 
way to determine what is the relation beUveen speed and stopping distance 
would be to make a number of tests in different portions of the city, 

* Some logical foundation is needed before a mathematical equation to a cur\’e can 
be of any more value than merely the chart which graphs the curve. Thus in the gravity 
example it is evident that the farther a body falls, the faster it falls ; in every successive 
instant the speed it has already attained is increased by the effect of the continued pull 
which is added to it. Purely mathematical investigations of the relation between such 
constantly growing magnitudes and the variable with which they grow have enabled 
physicists to determine the general mathematical r>y>e to which the relation must con- 
form. Then, knowing the type of the curve, it is relatively easy to determine the value of 
the constants (such as the “J” of the equation Y = which makes the general 

equation applicable to a given specific case. This is done by using experimental results, 
such as those given in Table 3.1, to calculate the constants for the specific type of curve 
which has been determined upon. 

An algebraic equation which expresses the relation logically expected between or 
among two or more variables is sometimes called a “model” of the relationships. 
Such a model is a mathematical expression of the hypothesis according to which the 
observed data will be examined to see whether or not the facts support the hypothesis, 
and to determine the value of the statistics. Sometimes a model will consist of two or 
more equations. (Note further discussions in Chapters 24 and 26.) 

Not all functional relations can be subjected to this type of logical analysis, however, 
and it is sometimes impossible to tell what sort of equation the results should really 
follow. In that case any mathematical curve “fitted” to the data has no more special 
meaning than the graphic curve drawn through the center of the observations; both are 
merely empirical descriptions of the relations, and both are limited in their interpretation 
to the range of the particular data upon which they are based. This fact will be discussed 
more fully later on. 



40 htroductory Concepts 

taking different types of machines and different drivers chosen at random 
so as to get a representative sample * Let us suppose that as the result 
of such a senes of tests we obtained the senes of observations shown m 
Table 3 2 


Table 3.2 


Rtlation between Speed of AtrroMOBiLE and Distance 
Required for Stopping after Signal, as Shown by 63 
0bS£RVATIOI« 


Speed When Distance Traveled 

Signal IS After Signal Before Speed 
Given Stopping 

Distance 

Speed 

Distance 

Miles per 


Miles per 


Miles per 


hour 

Feet 

hour 

Feet 

hour 

Feet 

5 

2 

21 

39 

28 

84 

10 

8 

26 

39 

27 

57 

10 

17 

25 

33 

30 

67 

10 

14 

24 

56 

16 

34 

8 

9 

18 

29 

18 

34 

16 

19 

25 

59 

8 

8 

17 

29 

27 

78 

5 

8 

12 

11 

25 

48 

5 

4 

9 

5 

21 

42 

13 

15 

7 

6 

25 

56 

14 

14 

7 

7 

30 

60 

8 

13 

9 

13 

29 

68 

9 

5 

4 

4 

17 

22 

14 

16 

5 

8 

16 

14 

8 

11 

13 

18 

13 

27 

35 

85 

15 

16 

12 

21 

40 

110 

18 

47 

12 

19 

39 

138 

19 

30 

26 

41 

31 

77 

20 

48 

28 

64 

35 

107 

21 

55 

29 

54 

22 

35 

36 

79 

30 

101 

40 

134 


* The problems involved in choosing a sample that will both properly represent the 
universe, and measure the relations m the san^le so as to best judge what are the true 
relations in the universe, are discussed at greater length in Chapters 17, 18, and 20 



The Meaning of Function 

It is apparent from the table that there are great variations in the distances 
which different cars or different drivers required to stop, even when 
traveling at the same speed. This is shown even more clearly when we 
make a dot chart of the data in just the same way that was illustrated in 
Figure 3.2. The graphic comparison between speed and distance-to-stop, 
shown in Figure 3.3, reveals that there is only a general agreement between 
the different tests. There is certainly some relation between the two 



Speed of auto (miles per hour) 


Fig. 3.3. Relation of speed of automobile to distance it takes to stop, 
as shown by individual observ’ations, 

variables, but it is vague and uncertain in comparison with the relatively 
sharp and clear-cut relation shown in Figure 3.2. 

There is no difficulty in understanding why the relation is not more 
definite. The stopping distances may be influenced by many differences 
from car to car. Some cars will have brakes in adjustment and others 
brakes well worn; some cars will be nearly empty and others heavily 
loaded; some will have tires worn smooth and others new tires; some 
will have average brakes and others power brakes or exceptionally large 


42 


Introductory Concepts 

ones, as on sports cars In addition, the drivers differ Some are experienced 
drivers, some inexperienced, some strong and some unable to press the 
brakes fully down, some with almost instantaneous reaction to our 
signal to stop, some with faltenng or lagging response, some bright and 
wide awake, others tired and unobservant, some calm and steady, others 
nervous and erratic Finally the conditions of the tests might be different 
—some on concrete pavement, others on asphalt, some on up grades, 
some downhill 

There are two ways by which we might go about deciding exactly what 
these varying observations showed One way would be to divide up the 
data so that the effect of some of the different factors would be removed 
from the results Thus if we separated the observations into different 
groups according to the make of car, and then each of these groups 
according to the model or the year made, the relation between speed and 
distance for any single subgroup would no longer be affected by differences 
m braking equipment so far as engineenng design went Most of the 
remaining factors, however, would still be present to affect the results, 
so that even within each subdivision the records would still show great 
diversity in the relation Only if we continued the process of subdivision 
of our sample until we got down to successive observations of a single 
car operated by a single driver at the same place, would we be likely to 
get observations more nearly consistent with each other Differences 
in the promptness with which the driver responded to the signal, m the 
preciseness with which the speed at the moment of giving the signal was 
observed, and in the force with which the driver applied his brakes, 
all might influence the result, so that even then the results would be less 
consistent — “the curve be less definitely defined” — than m a senes of 
laboratory experiments where all the important outside variables could be 
definitely controlled and so prevented from affecting the results obtained 

Should the entire mass of observations be analyzed as suggested, that 
would give a great number of different sets of relations, each one showing 
how long It took a given car to stop when driven by a given driver, when 
traveling at different speeds But this great number of different curves 
might not be suitable to answer our question They might be so different 
from curve to curve that it would seem that there was no real general 
relation between speed and distance A new car with power brakes, 
driven by an expenenced driver, might stop in its own length at the same 
speed at which an old car with brakes badly worn, driven by an inexpert 
driver, might require a far greater distance Obviously neither one of these 
extremes would be typical of the general relation, but what would be 
typicaP Even the less extreme cases might show great variations among 
themselves, so that it would be almost impossible to pick from the great 



The Meaning of Function ^2 

diversity of curves one or a few that would serve as a basis of judgment 
for our problem. 

A second way of going about it would be to try to determine some sort 
of average relation between speed and distance. In that case we should 
admit that there were great differences from the average in individual 
cases, yet should feel that the average would serve as a general indication 
of the relation, even though we were aware it would not be true in every, 
or perhaps even in any, individual case. If we knew nothing about a car 
except the speed at which it was moving, that average relation, however, 
would serve to give us the best guess we could make as to how far it would 
go before stopping. 

Where the relation between two variables is clear and reasonably 
sharply defined, as in the experimental cases discussed, it is not difficult 
to determine the average relationship, since the relation for individual 
cases and the average relation for ail cases are nearly identical. Where 
the relation is not so well defined, however, and especially where many 
other relations are involved in addition to the particular one which is 
being studied, it is by no means easy to determine exactly what the true 
relationship is. A considerable body of statistical methods has been 
developed to treat this problem, and is presented in the balance of this 
book. 

Summary 

A statement of the change in one variable which accompanies specified 
changes in another is known as a statement of a functional relation. A 
functional relation may be stated either graphically by a curve, or algebrai- 
cally by a definite equation. Although functional relations may be readily 
determined from experimental conclusions where all influences except 
the 6ne being studied are held constant, many problems cannot be studied 
by such methods. The statistical methods of regression analysis may be 
used to study functional relations where experimental methods are not 
satisfactory. 



CHAPTER 4 


Determining the way one variable 
changes tvhen another changes: 
(IJ by the use of averages 


The problem stated tn Chapter 3 was to determine how many feet 
automobiles traveling at a given speed require for stopping It involves 
determining the average extent to which one variable changes when another 
variable changes Stated mathematically, the problem is to find the 
functiopal relation between speed and distance — the probable distance 
required to stop with any given initial speed Of the many different 
ways of doing this the simplest and the one which would suggest itself 
most naturally, would be to classify the record into groups, placing all 
of one speed in one group, all of another speed in another group, making 
as many groups as there are different rales of speed recorded, and then 
averaging the different distances for all the cases in each group This 
would then give an average distance for stopping for each given rate of 
speed in the series of records Table 4 1 shows this operation carried out 
Where there were only single observations, this fact has been indicated 
by placing the average — the single report — in parentheses 
The averages m the last column of Table 4 1 show quite specifically 
how the distance required for stopping tends to increase with speed But 
the increase is not uniform The cars at 13 miles per hour averaged a 
greater distance than those at either 12 or 14, and the cars at 26 a shorter 
distance than those at 25 

If the successive averages from Table 4 1 are plotted and connected 
by lines, both the general increasing tendency and the irregular change 
from group to group are easily seen Figure 4 1 shows this comparison 
Do these differences between the different group averages have any 
real significance '> Is there any reason to think that this very jagged line 
IS the true average relation between speed and distance We can consider 
44 



Determining Functions fay Group Averoges 


45 


Table 4.1 


Computation of Average Distance Required for Stopping after 
Signal, for Different Initial Speeds 


Speed when Signal is 
Given 

Different Distances Noted 
at That Speed* 

Average Distance for 
That Speed 

Miles per hour 

Feet 

Feet 

4 

4 

(4) 

5 

2, 8, 8, 4 

5.5 

7 

6,7 

6.5 

8 

9, 8, 13, 11 

10.3 

9 

5, 13, 5 

7.7 

10 

8, 17, 14 

13.0 

12 

11,21, 19 

17.0 

13 

18, 27, 15 

20.0 

14 

14, 16 

15.0 

15 

16 

(16) 

16 

19, 14, 34 

22.3 

17 

29, 22 

25.5 

18 

47, 29, 34 

36.7 

19 

30 

(30) 

20 

48 

(48) 

21 

55, 39, 42 

45.3 

22 

35 

(35) 

24 

56 

(56) 

25 

33, 59, 48, 56 

49.0 

26 

39,41 

40.0 

27 

78, 57 

67.5 

28 

64, 84 

74.0 

29 

68, 54 

61.0 

30 

60, 101, 67 

76.0 

31 

77 

(77) 

35 

85, 107 

96.0 

36 

79 

(79) 

39 

138 

(138) 

40 

110, 134 

122.0 


* Data taken from Table 3.1. 




^ /ntroiJuctory Concepu 

that from wo points of vte» , the loetc of the rcUtton and the stattsttcal 
basts of the dilTerenccs Logically the dilTerences arc quite nonsensical 
If a giicn machine can stop in ■10 feet when it is going 26 miles an hour, 
of course it can stop in at least the same distance when going 25 miles 
per hour, and probably something less It certainly would not take 49 feet, 
M the table shows Then from the statistical point of view the groups 
are entirely too small to show definitely how many feet on the average 



fig 4 f Relation of speed of automobile to distance it takes to slop 
as shown by averages of small groups 

It takes to stop at any one selected speed Even the largest groups have 
only 4 cases whereas we have seen in Chapter 2 that 10 to 25 cases may 
be required as a mmimum to give an average of much reliability Com- 
puting the standard error for the average from the 25-mile group, it 
comes out 5 8 feet With only 4 cases, however, Figure 2 1 shows that 
we have to lake a range of I 15 times the standard error to make the 
observed salue come within that range of the true value in 2 samples out 
of 3 ’ There is thus I chance out of 3 that the mean shown by our sample 
fies more than 6 T feet above or befow the true mean The average for 
this group of records might therefore be written 49 0 ± 6 7 feet When 
we say that the aterage distance required for stopping when traveling 
20 miles per hour (for all automobiles m town, say) is between 42 3 feet 
and 55 7 feet, we are likely to be wrong in 1 out of 3 such statements, 
* The standard error is computed from the standard deviation of the 4 reports at 
25 miles using equation (2 4) This gwes a value of 5 8 Figure 2 1 shows that for 4 
reports a range of 1 15 times ihe computed standard error must be taken to secure a 
reliability of 0 67, so the confidence Iimjt. with F « 0 67, is ±(5 8)(1 15) or ±6 7 




Determining Functions by Group Averages 47 

on the average.^ With the average from one of the /argesf groups showing 
as little reliability as this, it is (|uite clear that the zigzag variation from 
average to average has no real meaning. 

Does that mean that in spite oF the relationship we can see in Figure 4 1 
that we can get no reliable statistical measurement of the relation? That 
is overstating the case a little 5 all that we have determined so far is that 
the line of averages, the irregular function shown in Figure 4.1, has but 
little statistical meaning just as it stands now. 

We might be able to make the results more reliable by basing our 
averages on a larger number of reports. But that would be a long and 
expensive process. Isn’t there some way we can tind out something more 
just from the records we have ? 

Another way of making the conclusions more stable would be by 
combining the records so as to give fewer groups, but with more cases 
in each group. So far we have been working with 29 different groups, 
one for each of the 29 different speeds measured. If instead we group 
them into a few groups — say 6 or 8 — we shall have considerably larger 
groups to work with. 

Independent and Dependent Variables. The question might be asked 
whether the groups should be made on the basis of the rate of speed or 
of the distance required for stopping. (In preparing Table 4.1 we used 
the rate of speed without discussing the matter.) That comes back to the 
question of what we really want to find out. Do we want to know the 
average speed at which machines were traveling when it took them, say, 
20 feet to stop; or do we want to know the average distance machines 
took to stop when they were traveling at a given speed? Obviously, the 
thing we are going to set is the speed limit, and we are merely interested 
in the distances for stopping as one factor to guide us in deciding what 
the speed limit should be. We therefore want to know the effect of speed 
upon average distance, and not the reverse. For that reason we shall 
classify our records on the basis of speed, and then average all the different 
distances for the cars traveling at that speed. 

The same question is met with in nearly all problems where the relation 
between two variables is to be dealt with. It is always necessary to think 
over the problem carefully, and decide which variable we are going to 
regard as the independent or causal variable, and which one as the 
dependent, or resultant. Thus if we were relating variations in tobacco 
yields to applications of fertilizer, obviously the differences in fertilizer 
would be the cause and the differences in yield the result, so we would 
sort our records according to the differences in fertilizer. Other relations 
may not be so clear cut. If the size of stores were being related to profits, 
it might be as logical in some situations to consider that the more successful 



^ Introductory Concepts 

men were able to afford the largest stores as to consider that the larger 
stores returned the greater profits Careful consideration of the facts m 
each given case is necessary to clarify exactly what is the particular 
relation involved 

As shown later {pages 74-80 and 475-476), it is frequently impossible 
to say which variable is the cause and which is the effect Yet one may 
wish to regard one vanable as the one whose values are given or known 
This is then called the independent lariabie and plotted as the abscissa 
The second vanable will then be regarded as the one whose values are 
to be related to, or estimated from, the values of the known vanable 
This IS then called the dependent variable, since it is treated as depending 
upon the given values of the independent vanable It is sometimes desir- 
able in particular problems to consider first one variable and then the 
other as independent 

Group* of Larger She. To return to our automobile problem Smce 
the speeds varied up to 40 miles per hour, and we have 63 reports to deal 
with, we might try breaking them up into 8 groups and see what kind 
of averages that will give u$ Using groups covering a range of 5 miles 
per hour each, we can group the records and determine the averages for 
the 8 groups thus formed, getting the results shown in Table 4 2 


Table 4 2 

Speed of Automobile and Distance Required for Stopping as 
Shown by Group Averages 


Speed when Signal 
Is Gi\en 

Number of Cases 

Average Speed 

Average Distance 

Miles per hour 


Miles per hour 

Feet 

Under 7 5 

7 

54 

56 

7 5-12 4 

13 

96 

11 8 

12 5-17 4 

11 

149 

204 

17 5-22 4 

9 

19 8 

39 9 

22 5-27 4 

9 

25 6 

SIS 

27 5-32 4 

8 

29 4 

719 

32 5-37 4 

3 

35 3 

90 3 

37 5 and over 

3 

39 7 

127 3 


These averages can then be plotted and connected by straight lines, 
just as were the averages m Figure 4 I In constructing Figure 4 2, 
which shows the result, u is necessary to use the average speed as well as 




49 


Determining Functions by Group Averages 

the average distance-for-stopping in locating each point. This is because 
each of the average distances, as shown in Table 4.2, represents not one 
speed, but several different speeds thrown together. The circles in Figure 
4.2 represent the several group averages plotted this way. The first one 
is located at the intersection of the lines for 5.4 miles per hour and 5.6 
feet, the second at 9.6 miles per hour and 11.8 feet, and so on for the 
remainder. 



Fig. 4.2. Relation of speed of automobile to distance it takes to stop, 
as shown by averages of large groups. 

When the group averages of Figure 4.2 are connected by straight lines 
the relation between speed and distance is shown much more satisfactorily 
than it was in Figure 4.1. The line in the new figure shows a fairly con- 
tinuous relation between speed and distance. But on close examination even 
the relation shown in this last figure is not found fully satisfactory'. If 
we compute the change in distance-for-stopping for each change of 1 mile 
in speed, we find that the conclusions are somewhat erratic. Between 
the first two averages, the change in speed from 5.4 to 9.6 miles per hour, 
an increase of 4.2 miles per hour, is accompanied by a change in distance 
from 5.6 to 11.8 feet, or an increase of 6.2 feet. Betw'een 5.4 and 9.6 miles 
per hour, therefore, the distance-for-stopping apparently increases 1.5 feet 
for each increase of 1 mile per hour in the speed of the machine, imi ar 
computations for all the other groups are shown in Table 4.3, carrying 

out just the same process. ^ a a 

The increase in speed from group to group varies etween an 
miles per hour, and the increase in distance tends to increase but irre^- 
larly. This irregularity is more clearly shown in the last co umn o 




50 


introductory Concepts 


Table 4 3 

Computation of Change in Distance for Each Change of I Mile 
IN Speed, for Group Averages 


Increase m 

Speed when Average Increase Increase Distance per 

Signal Is Given Speed m Speed m Distance 1-Mile Increase 

^ »n Speed 

Stopping ^ 


Miles per 
hour 

Miles per 
hour 

Feet 

Under 7 5 

54 

56 

7 5-12 4 

96 

11 8 

\1 5-n 4 

14 9 

20 4 

17 5-22 4 

19 8 

399 

22 5-27 4 

25 6 

51 9 

27 5-32 4 

29 4 

71 9 

32 5-37 4 

35 3 

903 

37 5 and over 

397 

127 3 


Miles per 


hour 

Feet 

Feet 

42 

62 

1 5 

53 

86 

1 6 

49 

19 5 

40 

58 

120 

21 

38 

200 

53 

59 

184 

31 

44 

37 0 

84 


table, m terms of increase in distance for each l-mile-per-hour increase 
in speed This vanes erratically, jumping from 1 6 to 4 0 between the 
second and third pairs, dropping back to 2 I, rising again to 5 3, and then 
dropping sharply, and rising in the last pair to 8 4 

This same variability m the rate of change can be seen directly from 
Figure 4 2 by noting the steepness of the several portions of the line 
The irregular and zigzag character of the line shows the same fluctuations 
that the computations in Table 4 3 show Simply by examining this chart 
closely It would have been possible to tell about the unsatisfactory 
character of the conclusions without taking the time to calculate the 
exact rates 

Are the irregularities shown m Table 4 3 and Figure 4 2 of any impor- 
tance, or are they due simply to the possibilities of variation m using 
so small a sample, just as were the differences in Figure 4 1 and Table 4 I 
Is it really true that an increase in speed has a larger effect upon the distance 



Determining Functions by Group Averages 

required for stopping between 15 and 20 miles per hour than between 10 
and 15? 

ReUabilitY of Group Averages. The answer involves a consideration 
of the statistical basis upon which our conclusions are based. These 
last results were calculated from the average speed and averase distance 
for the several groups of records; obviously they can be no more reliable 
than those averages are themselves. In measuring the reliability of those 
averages by the methods already discussed, we must estimate the standard 
errors which will tell us about how much confidence we can have in each 
figure. 

The next step, therefore, is to calculate the standard error for each of 
the 8 averages of distance. The computation, which is exactly the same 
as that used before, based on equation (2.4). is shown in Table 4.4. 

Table 4.4 

CO.MPUTATION OF CO.VnOENCE INTERVALS FOR AVERAGE DISTANCES 

Required for Stopping 


Number 
Grouped by of Cases, 
Speed n 


Average 

Distance 

Required 

for 

Stopping 

M 


Standard 

Deviation, 

5 


Standard 
Error of 
Mean, 


Confidence 
Interval for* 
P = 0.67 


Confidence 
Intervals for 
P = 0.67 



(1) 

(2) 

(3) 

(4) 

(5) 

(2) ± (5) 

Miles per hour 


Feet 

Feet 

Feet 

Feet 

Feet 

Under 7.5 

7 

5.6 

1.56 

0.59 

±0.6 

5.0- 6.2 

7.5-12,4 

13 

11.8 

5.08 

1.41 

±1.5 

10.3- 13 J 

12.5-17.4 

11 

20.4 

6,82 

2.55 

±2.7 

17.7- 23.1 

17.5-22.4 

9 

39.9 

8.81 

2.94 

±3.1 

36.8-43.0 

TLS-llA 

9 

51.9 

13.43 

4.48 

±4.8 

47.4- 56.4 

21.5-nA 

8 

71.9 

14.96 

5.29 

±5.7 

66.2- 77.6 

32.5-31A 

3 

90.3 

14.89 

8.68 

±11.4 

78.9-101.7 

37.5 and over 

3 

127.3 

15.33 

8.95 

±11.8 

1I5.5-139.I 


* These values are obtained by multiplying the standard error by the salues given 
in Table 2.3 and Figure 2.1. Interpolating from Figure 2.1, these are found to be; 
for n = 3, 1.31; for n = 7, 1.08; for n = 8, 1.08; for n = 9, 1.07; for n = 11, 1.05; 
for// = 13, 1.04. 


Comparing the several averages with their respective confidence inter- 
vals, as show'n in the last column of Table 4.4, we find that if we made the 
same number of observations over again a number of times and used the 
same grouping, there is 1 chance out of 3 that we might find that the true 
distance for the second group w'as less than 10.3 feet or more than 13.3 


52 Introductory Conccptt 

feet, and so on until for the last group it might be under 115 5 feet or 
o\er 139 3 feet With this wide possible sanation in the sample ascrage 
from the true values, it is quite evident that the real facts have not jet 
been measured accurately enough to justifjr detailed computations of the 
differences in the slope of diflerent portions of the line By changing any 
one of the averages as much as has been indicated, the slope of the line 
would be very materially changed 



F/j 4 3. Relation of speed of aulomotwle to distance it lalces to slop, 
as indicated by the confidence intervals 


Range Within Which True Refotlon May Fall. The extent to which 
reliance may be placed m the relationship between the two variables as 
shown hy a'ne observations Wnufn we have to 6ea\ with may he ju&gc6 
from Figure 4 3 Here the actual averages have been plotted, and lines 
drawn connecting them, just as before The confidence intervals (for proba 
bility of 0 67) are then indicated by placing a short bar above and below 
each average Connecting these by dotted lines then indicates the zone 
of confidence intervals around the line of averages Figure 4 3 indicates 
that there is a rather wide zone within which the true relation may fall 
especially for the higher speeds, even when we make that zone as narrow 
as we have in calculating for the moderate probability of being nght 



Determining Functions by Group Averages 53 

2 times out of 3. But we stiil do not have any definite measure of the 
general relation between speed -and distance. 

If a smooth continuous line or curve were determined, either by appro- 
priate mathematical processes or by freehand smoothing, and were drawn 
through the successive group averages, that would give a far better 
measure of the relationship. (It is left for the student to try drawing a 
smooth continuous freehand curve through the line of group averages, 
and to see how many times it falls outside the confidence zone. For this 
exercise, Figure 4.3 should be replotted on a larger scale, using the data 
of Table 4.2, and the last column of Table 4.4.) 

We have seen how increasing the number of cases included in a single 
group increased the dependence which would be placed in that group. 
However, even by reducing our 63 cases to 8 groups we have not been 
able to get a consistent and satisfactory statement of the relation. Is it 
possible that by handling all the data as a single group we could get a 
better result? One way of doing this would be to average all the speeds 
and all the distances together. But that would only tell us what was the 
average distance required for stopping and the average speed. What we 
want to know is what distance is most likely to be required at any given 
speed, and the treatment just suggested would not give us that. 

There is another way, though, of determining the relation while con- 
sidering all the records together. If we are willing to assume that an 
increase of one mile per hour in the rate of speed will increase the distance 
required for stopping by exactly the same number of feet, no matter how 
rapidly or how slowly the machine is already moving, then we can deter- 
mine this relation for all the data as a whole. On this basis a straight 
line can be used to represent the relation. All that we have to do to 
determine a straight line which will come as near as possible to repre- 
senting the relation as shown by all 63 individual observations. (A straight 
line might also be drawn by eye, as an exercise, and the results compared 
with those of the preceding exercise.) 

Summary 

The change in one variable with changes in another may be approxi- 
mately determined by grouping the records according to the independent 
variable and determining the corresponding averages for the dependent 
variable. Unless a very large number of observations is available, however, 
the functional relation shown by the successive averages will be irregular 
and inconsistent, owing solely to sampling variability. For that reason 
some method is needed for measuring the functional relation for the 
group of records as a whole. The simplest way in which this can be done 



54 Introductory Concepu 

IS b> assuming that the relation can be represented by a continuous 
straight line Methods of determining such a line \si!l be considered m 
the next chapter 


Note 4 1 It IS alvi'ays possible to reverse the dependent and the independent 
variables Thus ihc data presented in Figure 3 2 might have been plotled wnh time as 
the independent variable and with cLsiance fallen as the dependent A curve m ght 
then have been drawn m to show the average distance which a body can traverse for a 
given time of fall Similarly the data charted in Figure 3 3 might have been chaned 
with distance as the abscissa and speed as the ordinate These two altcmtic ways of 
stating the relation generally yield somewhat diflerent functions jf errors of measurement 
or outside" sources of variation arc at all large 



SECTION If 


Simple Regression^ 
Linear and Curvilinear 


CHAPTER 5 

Determining the ivay one variable 
changes when another changes: 
(2) according to the straight-line 

function 


There are several ways by which a straight line can be determined 
to show the functional relation between two variables. One way would 
be simply to place a ruler over the chart along the several group averages, 
or to stretch a black thread over them, and draw the line in by eye so as 
to fall as nearly as possible along them. Although no two persons would 
draw their lines exactly the same, still this method might give fairly 
satisfactory results where only a rough measure was wanted. However, 
it would often be advantageous to determine a particular straight line 
that could be exactly duplicated by other persons and that would qualify 
in some sense as the best possible straight line for expressing the relation- 
ship in this particular set of data. We shall therefore use the exact regres- 
sion method of determining the straight line. But first we must consider 
the meaning of a straight line. 

The Equation of a Straight Line 

The determination of what this line will be consists in finding the 
constants for the equation of the fine. Just as we have already seen 
(Chapter 3) that the curve showing the relation between the distance a 
body has to fall and the time it takes can be expressed by the relation, 

55 



56 Simple Regression, Unear and Curvilinear 

so any straight line can be expressed by the relation^ 

r=a + bX ( 51 ) 

Figure 5 1 illustrates the meaning of a and b m this formula When 
the value of X is zero, b times X is zero and V is equal to a This constant, 
a, therefore, gives the height of the line (in terms of Y or vertical units) 
at the point where X is zero This is indicated at the left edge of the chart 



Values of X 


Fig 5 I Graph of the function Y + bX 

From the same equation, every time X increases one unit, Y increases 
h times one unit, since Y is computed as a plus b times X The difference 
of the height of the line (measured in Y units), between the point where 
A' IS 1 and where X is 2, is therefore b units of F, just as indicated on the 
chart And this continues to hold true for every unit change in X, whether 
from I to 2, or from 0 to 1, or from 99 to 100 

The meaning of these constants in the equation of the straight line, 
as equation (5 1) is known, may be illustrated more concretely by taking 
some actual values for the constants a and b, and seeing how the line 
would look then If we take 3 for a, and 2 for b, the equation would 
then read 

r=3 + 2Ar 

Figure 5 2 shows the line for which this is the equation Thus if X 
IS taken as zero, the value of Y is found to be 

F= 3 + (2x0) = 3 + 0 = 3 

And 3 IS therefore the Y value corresponding to the X value zero 

* Written this way, the equation is a perfectly general one which can be applied lo the 
relation between any two variables, caDing one of them Y and the other one X 
The symbol Y in the equation simply represents the number of units of the variable we 
designate as Y, whatever that may be, acres, dollars, pounds, and the symbol X 
likewise represents the number of units of the variable we designate as X 



57 


Simple Linear Regression 

Similarly if X is taken as 10, 

y = 3 + (2 X 10) = 3 + 20 = 23 

And the Y value corresponding to the X value of 10 is therefore 23. All 
other values of Y which may be computed for values of A' within the range 
shown in Figure 5.2 will similarly be found to lie exactly on the same line. 

Figure 5.2 illustrates again the meaning of the constants a and b. 
When X is zero, the value of Y is three units above zero, as indicated, 



Fig, 5,2. Graph of the function 7=3+ 2X. 


and for every unit increase in X (say from 5 to 6) the value of Y goes up 
2 units. This is exactly the same thing as shown in Figure 5.1, except 
that there no definite values were assigned to a and b, whereas here they 
have been given exact numerical values. 

It is quite apparent from Figures 4.2 and 4.3 that the relation between 
auto speeds and distance required for stopping could not be well 
represented by a straight line. We will therefore use another example 
to illustrate this technique. This case is taken from hydrology, where 
it is important in plarming hydroelectric, flood-control, or irrigation 
projects, to know the past history of stream flows. In one case a dam was 
being planned on the Kootenai River at Newgate, B.C., near the point 
where it crossed the Canadian border, but stream records there began 
only in 1931. Records were available for a longer period at Libby, 
Montana, further down the stream. How could the flow at Newgate 
be estimated from that at Libby? Since we wish to estimate the flow at 
Newgate from that at Libby, we must regard the volume at Newgate as 
the dependent variable, and that at Libby as the independent variable. 



58 


Simple Regression, Linear and Curvilinear 


Tabid 5.1 


Water Flow at Two Points on Kootenai River, in January,* 
AND Calculations eor Fitting a Straight Line 


Year 

Newgate 

BC, 

Y 

Libby, 
Mont , 

X 

X^ 

XY 


Units of 100 cfs 



1925 


420 



26 


240 



27 


38 0 



28 


494 



29 


24 6 



1930 


24 2 



31 

19 7 

27 1 

734 41 

533 41 

32 

180 

209 

436 81 

376 20 

33 

261 

33 4 

1,115 56 

87174 

34 

449 

776 

6 021 76 

3,484 24 

1935 

261 

37 0 

1,369 00 

965 70 

36 

19 9 

216 

466 56 

429 84 

37 

157 

176 

309 76 

276 32 

3S 

27 6 

35 1 

1,232 01 

968 76 

39 

249 

32 6 

1,062 76 

81174 

1940 

23 4 

260 

676 00 

608 40 

41 

23 1 

27 6 

761 76 

637 56 

42 

31 3 

38 7 

1,497 69 

1,211 31 

43 

23 8 

27 8 

772 84 

661 64 

Totals 

Sy = 324 5 

2^ = 4230 

EAT* = 16,456 92 

SArr= 11.837 32 

Means 

My = 24 96 

M, = 32 54 




• Source Extending Stream-Flow Records, U S Department of the Interior 
Geological Survey, Water Resources Branch pp 7, 8, September, 1947 


The relevant data are shown in the first two columns of Table 5 1, 
together with data for six earlier years at Libby (These Libby observa- 
tions are actually available still further back) When data for 1931 to 
1943 are plotted on a dot chart, as shown in Figure 5 3, a marked relation 
between the two variables is evident, one that can apparently be represented 



Simple Linear Regression 

by a straight line. The next step is to determine the straight line that best 
describes the relation. We will therefore use X to stand for flow at Libby, 
and 7 for flow at Newgate, each in hundreds of cubic feet per second (cfs)! 

Thus when we write the equation Y := a + bX, we shall be using that as 
shorthand for 

Flow at Newgate (in 100 cfs) = a + b (flow at Libby, in 100 cfs) 



10 30 50 70 90 

Flov/ at Libby (units of 100 cfs), X 


Fig. 5.3. Dot chart of observations of water flow at Libby and 
Newgate, and straight line fitted to them. 

To give this equation definite meaning we must determine the numerical 
values for a and b, just as in our preceding illustration we had to assume 
numerical values for these constants before the graph had any definite 
meaning for us. 

The “Observation Equations.” One way of finding the values is by 
regarding each one of our original observations (Table 5.1) as an algebraic 
equation itself. Thus the 1931 obser\'ations, 27.1 at Libby and 19.7 
at Newgate, would be written 

19.7 = a + b(27.1) 

putting the 1 9.7 in the place of 7 in the equation, and the 27. 1 in place 
of 7. 


60 Simp/e Regression, Linear and CurviUnear 

Similarly the next observation, ISO at Newgate and 20 9 at Libby 
would be 

180 = fl + b(20 9) 

and so on right through the last observation 
Bringing all these different equations together would give a senes 
looking like this 

197»n + 27 1b 
J8 0 = ^2 + 20 9^ 

26 I = n + 33 4b 


31 3 «= a + 38 7b 
23 8 = o + 27 8b 

(The middle equations are omitted here to save space ) 

Since we have 13 observations of both variables, we have m all 13 
observation equations, each containing the two unknown constants a 
and b 

Now by the rules of simple algebra, any two independent equations 
containing the same two unlaiown constants can be solved simultaneously 
to obtain the numerical values for those constants One way to find the 
values of our a and b would be to pick two of the equations representing 
our observations and solve them simultaneously Suppose we take the 
first and the last ones , we shall then have 

<1 + 27 lb =19 7 and a + 27 8b = 23 8 

Solving these two equations simultaneously, we find the values a — 
— 139 01 and b = 5 86 But in getting these values we have used only 
2 out of the 1 3 observations, and also 2 whose values of X were very close 
together Would we get the same result if we used another pair‘d Let 
us try the second observation and the next to the last Then we will have 

o + 20 9b =18 0 and a + 38 7b = 31 3 

These two equations, solved simultaneously, give the values a = 2 39, 
and d = I? r^T, wAicfi are certainiy tar airterent irom the first set 
Apparently the values obtained by this method would depend upon the 
particular pair of observations selected, perhaps varying with each pair 

If we work out estimated values of Y for selected values of X according 
to these two solutions, we get results as follows According to the first 
set, 

r= -13901 + 586jr sowhenr=20, r=-2I8, 
and when = 30, T = 36 8 



Simple Linear Regression 

But according to the second set. 


61 


Y = 2.39 + 0.75X so v/hen = 20, 7 = 17.4; 

and when X = 30, Y= 24.9 

If we should then plot the two calculated points for the first of these 
equations, and connect them by a straight line, we should find that that 
line also passes through the two dots which represent the two observations 
from which the values were calculated. The same would hold true for 
the second line and the second pair of observations. We could compute 
as many different lines as there are different possible pairs of observations 
not lying on the same line. Fitting a straight line to two points, as we have 
done here, is simply equivalent to drawing a line to pass through those 
two points. This can be seen by plotting the two lines just calculated on 
Figure 5.3. (This is left as an exercise for the student.) Quite clearly no 
single line could pass through all the different points. If we computed 
more lines by this process of using selected pairs of points, we should 
just get a larger variety of different lines. And the closer together the 
points were, with respect to the values of X, the greater the differences 
in the slopes of the lines would tend to be. 

Fitting the Line by “Least Squares," If we are going to use a mathe- 
matically determined straight line at all, what we need is one which 
represents all 13 observations, instead of any particular pair of them. 
No one line can exactly fit all 1 3 obser%'ations, for, as we have just seen, 
the line which would exactly agree with the first and last would not agree 
at all with the second and next to the last. What we shall have to find is 
some compromise line w^hich will come as near as possible to agreeing 
with all the 13 observation equations, even though it does not exactly 
agree with any one. Mathematicians have derived a methofi of obtaining 
such a line by the use of what is known as the “method of least squares.” 
This method takes all the observations into account, giving each of them 
an equal w'eight in determining the result — a line such that the sum of 
the squares of the departures of Yfrom the line will be as small as possible. 
It also has certain other mathematical properties which make it of great 
value in handling problems of this sort. 

The equations upon which the process is based are derived by the 
use of calculus. The method itself, however, is simple and can be used 
by anyone with a knowledge of simple algebra. 

Computing the Extensions. The individual observations are listed 
as already shown in Table 5.1. Each X item is squared and entered in 
the column headed X^', and each X item is multiplied by the accompanying 
Y item, and the product is entered in the column headed XY. All the 



62 


Simple Regression, Linear and CurflUnear 


Hems in each column are summed (excluding those for the )ears before 
1931), giving the totals at the foot of each The symbol SAT represents 
the sum of all the V items, SK represents the sum of all the Y items. 

represents the sum of all the items, and 2(A K) represents the 
sum of all the products of Af T The means of X and Y are also calculated, 
and entered in the final line 

Solving the Equations We next proceed to find the values of a and b 

by using the following formulas 


S(A'r) - /?A4A/, 


(5 2) 


(5 3) 


In using these formulas the value of b is determined first, then it is 
used in the next formula to determine the value of a * 

Using the values for SA', £K S(A^. and S(A'T) given in Table 5 1, 
in equations (5 2) and (5 3). we find the values of b and a to be 

, ^XY)-nMMv 11.837 32 - 13(32 54X24 96) 1,278 74 

2:{A'*) - “ 16,456 92- 13(32 54){32 54) “2,691.85"°**^^ 

0 = A/, - bM^ = 24 96 - (0 475X32 54) = 9 504 


' If both X and Y had been stated m terms of deviation from iheir mean values (just 
as was done when the standard deviation 5 was computed in Table 1 6) they would have 
been denoted by the symbols lower case a and lower-case y The product m the fourth 
column of Table 5 1 would then have been designated xy, and its sum, £(zy) The 
correction factors used in equation (S 2) are simply to change the sums of the original 
extensions, KA*) and ZiXY) to what they would have been if computed instead 
from (he deviations from the mean That is to say. 

Z(XY)~ and SfAT*) - = Sfz*) (5 4) 

Equations (5 2) and (5 3) arc only another way of stating the “normal equations, ' 
which can be solved simultaneously to give the values for a and b These equations are 

nfl-f-(SAr)6=»Sr 
(SA-)fl -KSJT)/) = EArr 

When these two equations arc solved simultaneously, the results are 

b = anti a = Af, — bM, (5 5) 

The method by which this line is fitted rests upon the assumption that the scatter of 
the individual observations around the fitted Lne will approximate a normal distribution 
If one or two observations are exceedingly erratic as compared to the others, so that 
the scatter of the observations around the line will be very skew, this method of fitting 
may be unsatisfactory 



Simple Linear Regression 

The equation for the straight line, as thus determined by all the observa- 
tions, is therefore 

Y = 9.504 + 0.475Ar 

This line is called the line of best ft, since it is the line which gives, 
for all the observed values of X, values of Y which come as near as 
possible to agreeing with all the different F values observed.^* 

Estimating Y from X. We can now take any given flow at Libby 
we wish and estimate from the equation what would be the most probable 
flow at Newgate on the basis of the straight-line relationship. 

If 2,090 cfs (the flow in 1932), is taken, X will be 20.9. Substituting 
this value in the equation gives an estimated value of Y, designated Y' . 

Y’ = 9.504 -f 0.475(20.9) = 19.43 

So the expected flow at Newgate, for 20.9 at Libby, would be 19.4. 
In 1932, the actual flow was 1 8.0— slightly lower than estimated. Similarly, 
if we take the unusually heavy flow in 1934, of 77.6 at Libby, and substitute 
in the equation, we get the result 

Y = 9.504 -f 0.475(77.6) = 46.36 

This estimate compares with an actual flow of 44.9 that year — again 
somewhat lower than estimated. If we similarly estimate Y for each year 
from the equation, using the corresponding values of X, we get results 
shown in Table 5.2. 

Subtracting each Y' from the actual 7 gives the differences which are 
designated z. The z's have a mean of practically zero (the slight difference 
from zero is due to rounding off in the calculations). We can now 
plot the line of relationship on Figure 5.3, by using any two 7' with the 
corresponding values of X. This line, shown as a solid line, indicates 
the estimated value of 7, 7', for any value of X within the range shown. 

Interpreting the Linear Equation. Just what does the line of least 
squares tell us ? What is the meaning of the fitted line 7 = 9.504 + 0.4752r? 
What do the constants of the equation — the statistics 9.504 and 0.475, 
which we have calculated from the values given in our sample of 13 
cases — really tell? 

The first of these statistics, the value for a, is the height of the line when 
X = 0. It indicates that a flow of 9.5 units might be expected at Newgate 

^ If the differences between each of the actual observations and the estimated values 
given by this equation are computed, squared, and summed, that sum will be smaller 
than it would be if any other straight line were used. Since this method determines the 
line with the smallest possible squared deviations, the line is known as the least-squares 
line, and the method of computing it is known as the method of least squares. 



64 


Simple Regression, Linear and Curvilinear 


Table 5.2 


Water Flow at Two Points and Flow Estimated by Linear 
Equation 


Year 

Libby, 

Mont 

X 

Newgate, 

BC, 

Y 

Estimated by 
Equation 

Y 

Residual 
y- K 



Units of 100 cfs 


1925 

42 0 


29 5 


26 

24 0 


209 


27 

38 0 


27 6 


28 

49 4 


33 0 


29 

24 6 


212 


1930 

24 2 


21 0 


31 

27 1 

197 

22 4 

-2 7 

32 

20 9 

180 

194 

-14 

33 

33 4 

261 

25 4 

07 

34 

77 6 

449 

46 4 

-15 

1935 

37 0 

261 

2? 1 

-10 

36 

21 6 

199 

19 8 

01 

37 

17$ 

15 7 

179 

-2 2 

38 

35 1 

27 6 

26 2 

14 

39 

32 6 

24 9 

25 0 

-01 

1940 

26 0 

23 4 

219 

15 

41 

27 6 

23 1 

22 6 

05 

42 

38 7 

31 3 

27 9 

34 

43 

27 8 

23 8 

22 7 

I 1 


even when there is no flow at all at Libby Since Libby is downstream 
from Newgate, this seems to be an absurd result The statistic a therefore 
has no meaning of and by itself m this particular example, beyond placing 
the height of the line as a whole for the range within which it does have 
meaning 

The statistic for b, on the other hand, is always meaningful It indicates 
the difference in K for every difference of 1 unit m A", on the average of all 
the observations, and withm the range covered In this example the value 
of 0 475 indicates that between 17 6 and 77 6, each increase of 1 unit m X, 
that is, each increase of 100 cfs, at Libby, is associated with an increase of 
0 475 units, or 47 5 cfs, in the flow at Newgate This kind of interpretation 



Simple Linear Regression 

of the \'alue of b can alw ays be made, and is one of the most important 
results obtained by determining the statistics for the straisht line. 

Ronge Within Which the ^imates Are MeaningfuC If we estimate 
the flow Y when A' = 200, it comes out 107. But 200 for X is almost 
three times as large as the highest value reported durina the entire period 
from 1931 to 1943, on which we based our statistics. Just as we base seen 
that the estimate of Y' for A" = 0 at far below the smallest of the actual 
obser\’ations gives an irrational result, we must use caution in estimatina 
what the relation would be for a value of X far beyond the ranee of actual 
observ'ations. We know that within the observed ranae of A", from rouahlv 
18 to 78, the straight line represents the relation fairly well. Beyond that 
range, we cannot be sure it will still apply. Chapters 17 and 18 eive more 
exact bases for estimating the confidence with which we can use particular 
estimates. 

Only Kithin the range covered by the original observations of X can an 
estimating equation of this type be used with confidence. The onh' exception 
would be in a case where there is some logical reason, based on other 
knowledge, to believe that the linear relation would hold true beyond the 
observed range. 

CONTIDENCE INTERVALS FOR THE STATISTICS. The statistics for a and b 
calculated from a sample of obseivations are not necessarily the true 
parameters for the entire universe. Instead, the statistics vill vaiy from 
one sample to another. If. for example, it had been possible to use records 
of water flow for one hundred years at the two points (and if the conditions 
in the universe had shown no change during that period) it would ha%e 
been possible to compute statistics for a and b which would ha%e been 
far more reliable. Just as standard errors and confidence inteivals can 
be estimated for means, so they can be estimated for a and b, by methods 
given in Chapter 17. 

CONFTOENCE INTERVALS FOR THE ESTIMATED VALUES. Similarly, the 
estimates of Y{Y'), made from the equation, are subject to possible error. 
The standard error of estimate based on the standard deviation of the 
residuals, s., serves as one indication. The standard error of estimate 
for the straight line is given by the equation 




V -,2 



(5.6) 


In this case, — 1.79. From Figure 2.1 we see that with 13^ cases 

* Since there are two constants involved in the estimate in this case, a and b. ve must 
deduct one obsersation, i.e.. use 12 instead of 13, in reading the values from Table 2.s 
or Figure 2.1. 



55 SJmp/e degression, Dneor ond Curviimeor 

we must multiply the calculated standard error by about 1 05 for P = o 67, 
and by 2 2 for P »= 0 95 Accordingly, m this case we would expect 
two-tbirds of the new estimates made by the calculated equation to fall 
within a confidence interval of y' ± 1055,*, or ±1 88 of the true values, 
and nineteen out of twenty within an interval of ±2 25,*, or i3 93 
In using these confidence intervals, we are assuming that a given value 
of X might have been assoaated with any of a number of values of Y 
and that these would be distributed about an average value of Y which, 
m the universe of all possible y’s, would be associated with the given 
value of X Thus, we mi^t wnte the 95 per cent confidence interval 
for Ygaen the 1928 value of A* as 33 0 ± 3 93, or from 29 1 to 36 9 ® If 
we could select a large number of years in each of which X stood at the 
1928 level, we should expect to be right 95 times out of 100 if we assumed 
that the range y ± 3 93 included the average universe value of Y which 
was associated with an X value of 49 4 This assumes that there was no 
change m the universe during the period The residuals in Table 5 2 
suggest a slight upward trend with the passage of time, so this assumption 
may not be completely accurate in this case 
Fitting a Line dy Semi-Averages There is a simpler method of fitting 
a straight line, which is useful for preliminary studies, and where there are 
no extreme cases or single wide departures from the apparent line, 
especially at the extremes (as with the observation for 1934 in the present 
example) 

In using this method, we divide all the observations into two approxi- 
mately equal halves according to the values of the independent variable 
X, and take the averages of X and Y for each group From Table 5 1 
we see that the first group should be from 17 to 28 (7 cases), and the 
second 32 and above (6 cases) Calculating the averages for each group, 
we obtain values as follows 

Lower group, Af* = 24 09 M, = 20 51 
Upper group, = 42 4 M, = 30 15 

Setting up these values in equation form as before, 

20 51 =* a -f 24 096 
3015 = 0-1-42 46 

and solving the equations simultaneously, we get the values a *= 9 14, 
and 6 = 0 47 The corresponding line, y = 9 14 -F 0 41 X, can also be 
plotted in Figure 5 3 Since the values of a and 6 here are very close to 

® More exact procedures for estimating the reliability of a single estimate are given on 
pp 319 and 320 of Chapter 19 



67 


Simple Linear Regression 

those determined for the line by least squares, the two hnes will lie very 
close to one another. ^ 

The line of semi-a\-erages has no exact basis for its fit, and will vaiy' 
somewhat with the exact composition of the groups selected, and does 
not necessarily make the average of 7' coincide with the average of Y. We 
can see the effect of this lower accuracy when we compute the estimated 
Y' for 1942. by the semi-average line, with X — 38.7, Y' = 21.2, and 
? = 4.0. This year showed the largest error by the least-squares line, but 
that was only 3.3, which is substantially smaller. 

Methods of estimating standard errors or confidence inter^-als for the 
line fitted by semi-averages have not been developed, and it is therefore 
not used where reliable results are desired. 

Usefulness of the Straight Line. The straight line is a type of relation 
of very great importance and usefulness. It is one of the simplest functions 
to fit and to explain, and for that reason it is veiy- widely used. Equations 
(5.4) and (5.5), which are used in determining the constants of the equation, 
are therefore of great importance. The student of analytical statistics 
should become thoroughly familiar w'ith the methods of determining the 
constants of the equation and should understand thoroughly both the 
meaning and the limitations of this type of analysis. 

Determining the constants for an equation for a given set of observ'a- 
tions is called "Jilting" the equation to the data. Because the linear equation 
is one of the simplest of all equations to “fit,*’ it is widely and frequently 
used. In many cases no other possible relation is even considered. Actually, 
however, the linear equation is veiy' limited in its logical meaning. By 
its very' nature, it can represent only a situation where the change in the 
dependent variable for a unit change in the independent variable would 
be expected to be just the same regardless of how' large or how small 
the independent variable was; i.e., where the regression line has the same 
slope throughout. This is a very precise and narrow relation. In many 
cases, the line which theoretically' w'ould be expected would have a changing 
slope as the value of the independent variable changed. Unless there is 
a good logical reason to expect the linear equation to represent truly the 
situation present, fitting a straight line can be regarded only as an empirical 
exercise, with no meaning to the constants obtained beyond the purely 
formal one of specifying the straight line that most nearly represents the 
observed data. 

Summary 

To express a functional relationship by a straight line, the constants 
may be determined arithmetically by the methods of semi-averages or 



^ Simpfe fiegress/on, Uneor and Cunilmtar 

of least squares The least squares Ime gives the line of best ft under the 
assumptions of that method a normal distribution of the observations 
around the line and the reduction of the squared residuals to a minimum 
Estimates of the dependent vanable may be made according to the linear 
function for any value of the independent vanable Only within the range 
which includes the bulk of the independent values does this estimate have 
meaning, however, and only then if the straight line gives a satisfactory 
expression of the observed relation, cither empirically or logically 


Note 5 I Just as a straight line can be fitted to show the average flow at Newgate 
for each gi\en flow at Libby so another straight line can be fitted if the variables are 
resersed *In that case the flow at Libby would be regarded as the dependent or Y 
variable and that at Newgate would be regarded as the independent or X vanable 



CHAPTER 6 


Determining the way one variable 
changes when another changes: 
(3) for curvilinear functions 


A straight-line equation is frequently a fairly good empirical statement 
of the relation between two variables even when the true relation is more 
complex than the straight line can portray. Yet it may be just as important 
to know the exact or approximate “form” of the relationship as it is to 
have an empirical statement of it. For that reason it is necessary to con- 
sider other ways of expressing a relationship than the straight line. 

The automobile-stopping case (Figures 4.1 and 4.2) showed that the 
relations could not be well expressed by a straight line, especially below 
15 miles per hour, and above 35. The latter might be very important for 
the purpose of the whole study. 

The real difficulty involved would have been the assumption that the 
straight-line function applied. That would have assumed that an increase 
of one mile in the speed of the car increased the distance required for 
stopping by the same number of feet, no matter how fast the car was 
already traveling. When we examine Figures 4. 1 and 4.2 closely, we see 
that this is not correct; the line of averages slants up slowly at first, then 
tends to rise more steeply as the speed is increased, until it has the steepest 
slope at the highest speed. It is therefore incorrect to assume that we can 
express the slope of the line by determining the average increase in stopping 
distance for an increase of one mile in the rate of speed; for the increase 
in stopping distance is not the same regardless of the rate of speed, but 
tends to become greater as the rate of speed increases. Only if our way 
of stating the line can express that fact too will it sum up all our observa- 
tions with sufficient accuracy. 

What is needed is some general way of stating the relation between 

69 



70 


Simpfe fiegfcjs/en bnear and Curvi! neor 

speed and distance similar to the genera! relation expressed in the straieht 
line formula )et expresstn* a dumgmg slope instead of the uniform slope 
showTi b\ the straight line 


Different Types of Equations 

In the same way that it n possible lo represent relations mathemaijcally 
by a strai^-hl line it is possible to represent them by curses of \anous tj-pes 
\^c ha\c seen hou the equation ) = n + ^A' can be used to represent 
any straight line by determining the proper values to be assigned to the 
constants a and b There ib practically no limit to the different kinds 
of curves which can be similarly desenbed by mathematical equations 
The equations ofa number of curves which arc useful in statistical anal)'sis 
of the relations between vanabtes are 


} ^a + bA +cX^ 

(-} 

log ) «a + i\ 

(») 

log } s= o + 6 log A 

w 

1 = a + blog X 

w 

a + bX 

w 

Y = a + bX+cX^ + dX'^ 

(/) 

l=„ + W+c(i) 

fe) 


Each of these equations can be used to represent a certain tj'pe of 
curve Thus tjpe (a) is the equation of a parabola If we take certain 
values for the unknown constants a b, and c, substitute them m the 
formula, work out the values of Y for various values of X, and plot them 
just as we did before, wc will see the sort of curve this equation can be 
used to express Thus if wc lake I for o, 0 5 for b, and —0 I for c, the 
equation will read 


Y - I +05Jr-0 lA’ 



Simple Curvilinear Regression 

When the value of Zis 0, 7 will be 1, obviously. When Xisl,Y will be 
r = 1 + 0.5(1) - 0.1(12) = 1.4 
Performing this operation for other values, we obtain 

X value Y value 

0 1.0 

1 1.4 

2 1.6 

3 1.6 

4 1.4 

5 1.0 

6 0.4 

Plotting each of these values on cross-section paper and drawing a 
smooth curve through the several points, we have the curve shown in 
Figure 6.1, center top section. This discloses one characteristic of this 
type of curve — the curve is always symmetrical on both sides of the highest 
point — ^the point where it stops going up and starts to turn down (as 
half way between X=2 and X = 3 in this case). The value of Y when 
X=2is the same as when X = 3. When 7 = 1 it is the same as when 
X = 4 and, for X = 5, Y is the same as when 7=0. The curve could be 
cut into halves at the point of turning downward, one of which would be 
the reverse of the other. Besides this characteristic symmetry, this curve 
has another peculiarity — it has one, and only one, change from moving 
upward to moving downward, no matter what values are assigned to a, 
b, and c, or how far it is carried out. For the equation shown, the curve 
reaches its highest point when X = 2.5. As shown in Figure 6. 1 , the curve 
continues downward on both sides of this point, no matter how large the 
positive or negative values of X become. Thus if 7 = 100, Y = —949, 
or if 7= -100, 7= -1,049. 

If the value of b were negative and c were positive, the curve v/ould then 
be concave from above instead of convex and would be symmetrical 
with respect to its lowest point. 

Because of the characteristics mentioned, this type of curve is not very 
satisfactory to represent many types of relations. It does have great 
flexibility, in that many differently shaped curves can be represented by 
some particular segment of the parabola; but on the other hand the 
parabolic shape itself is so simple that many times the real relation between 
the variables cannot be represented by it. 

The characteristics of a number of other types of simple curves are 
also illustrated in Figure 6.1. In each case an equation of the type indicated 
has been assumed, and the values of 7 corresponding to values of 7 



72 Sim^/e Regression, Linear and Curvf/ineor 

have been computed as has just been done for the simple parabola Then 
plotting these computed values gives the curves shovm Thus type (/) 
the cubic parabola is seen to have one maximum point and one minimum 
pomt and one point of inflection (the point where the curve changes from 



concave from above to convex, or iice newo) No matter what values are 
assigned the constants in this equation, it can have only the single inflection 
and the two points of maxima and minima Of course the particular data 
to be represented might fall anywhere along the entire course of the curve 
— if only a single change from positive to negative slope were required, 
the point of inflection in the cubic parabola might lie beyond the extremes 
of the data, and so not show at all when the fitted curve was plotted for 
the range covered by the data 



75 


Simple Curvilinear Regression 

Figure 6.1 also illustrates curves of types (b) to (e), as well as some others 
not given special designations here. In each case where the log of Y 
used in place of Y, it is evident that the previous curve has been modified 
as if by compressing the ordinates nearest zero and stretching out the ordi- 
nates farthest away from zero, stretching them more and more as they 
depart more and more from zero. This process transforms the straight 
lines ofY = a + bXtoa curve concave from above when log Y = a + bX 
is used; or, when log Y = a + bX + cX^ is substituted for Y= a 
bX + cX^, it can sharpen the top of the bend if b is positive, or the bottom 
of the dip if b is negative. Similar results are found v/ith the cubic parabola. 

Similarly, when log X is used in place of X, the previous curves are 
modified as if the abscissas were compressed near zero, and stretched out 
in the higher values. This changes the straight line of Y = a + bX to 
a curve for Y = a + b log X, convex from above when b is positive and 
concave from above when b is negative. The parabolas are similarly 
transformed, making the slopes different on each side of the bend in the 
simple parabola or on each side of the inflection in the cubic. The effect 
is to move the first “hump” or “dip” in nearer to the zero abscissa and 
to stretch out the remainder of the curve (including the second bend, in the 
case of the cubic parabola). 

When logarithms are used for both X and Y, the effect is to modify 
both sets of coordinates in the manner previously described. The cun.e 
log F = c -f 6 log X may have either a concave or convex bend if b is 
positive, but is always concave from above if b is negative. Similar modifi- 
cations are noted in the case of the simple parabola. 

In any event it should be noted that the curves whose equations contain 
logarithms retain some of the same characteristics as those with similar 
equations without logarithms. Thus the linear equations (with only a 
and b) never change from a positive to a negative slope; the simple 
parabola always has one such change, if carried out far enough; and the 
cubic parabola always has two such changes. In addition, it should be 
noted that a variable can be stated in terms of logarithms only if it has 
no negative values. Whereas the other functions can express negative 
values as readily as positive ones, the logarithmic cur/es alv/ays become 
asymptotic as they approach zero — that is, they tend to flatten out and 
to run almost parallel with the axis. This is because a logarithm cannot 
be obtained for a negative number. No matter how small a logarithm 
becomes, the corresponding antilogarithm is still positive, even if only 
a very small decimal fraction. 

The hyperbola [type (e)] shown just below the center of figure 6.1 
also is peculiar in that it can become asymptotic as it approaches both the 
X axis and the Y axis, even if one or both of the varialjles are in negative 



74 Stnpfe p£gnss cx Lt^esr tr^ CvmJ:->t£r 

>-aJa5S* Hcr*«%er, tfcs\-alt:5scf A'ard r»h-£hit approzchs ars noi tbs 
zsro >-aIces. as »iili th£ logandanrs ccnes, bat speaa! s-aloss whxii %a:y 
la each particular case a’-d dcpcad opoT tbc sal us of the co*istants a and b 
12 ths equatioa Sail core cornp^ci ctrrvcs of the sans fnrpcrboSa n-pe 
nav bs obtained bj irdudns higher po»eis of X, such as 
t 


c-rbX-t-eX^ 


Still other ctirvss may be represented fay hybrid equations, whkh com- 
bine two or more of the simple types desenbed thus far, TInis typs (*) 
IS a compoend of a simp’e linear equation and a simple hyi^rbola. This 
IS sometirres itieful to represert corses wfiKh cannot be r r p re ^ .ted fay 
the simp’tr The cho-ce of an equation to represent a particclar 

<et of data, howm-er, dep^ds epon logKal analysis as s*eU as upon the 
e?npir*cal abfliiy of a giren equation to represent the relation fouad, 

The Lcgicct Sign^cence of MathemctJcaJ Functions, There has 
been frequent rtferecce previously to the question of w bether an equanon 
did or did not express ‘The real nature” of a relationship To know when 
w e tt ould be juaidsd in using a s.mple freehand curve, and » hen we shocJd 
go to the additional work of detenatamg an eqoaiion for tbs curve, 
we mu>t understand the logical bases for different types of equations, 
so that we can judge whether or not any particular ^710 of curve can 
logically be expected to expres* the relation m any gneo set of observa- 
tions 

The Linear Equation. \£afly relations aie so smph that orduiaTly 
we would not think of expressing then mathematically. This, if a L^aia 
s traveling 45 miles an hoar, the dstarce traveled s equal to the t:.’^ 
njult'piied by the speed Using f for lbs time in honis, d for dista^^. 
aid s for speed, the relation is obviously d = st 
Thii IS a simple straighi-Iire relation. Now, if, in addition, th.e tram 
were a miles away from a grvea station at the be ginning, after t hours of 
add.tional travel away from the station it would be D miles away, where 
D = a-Td=a-^st 

This is now expressed m the usual form for the straight-line equation. 


fitti-t 
Y = 


Ttere are tJrte tvpa cf s;^!s typerbolas wh-ch are freqasmly tsefsl ej ctsve 
B an eqe-lareial bvperboJa, ar v uyu me to a Ime psraliH to the Jf zxs, 
« 23 bvpe-1x!^ ajjTiptcoc to a lEie paraM W Tata, 

b J 2n etpjihrral hyperbefe aiy-rptciK to Imes paralltl to fccdi aiex. 



75 


Simple Curvilinear Regression 

Y=a + bX. This equation is therefore the one to be used when it can 
logically be expected that each unit change in X is accompanied by a 
corresponding change in Y, regardless of the size of X. Such a mathe- 
matical expression of the logical relationship, which is expected to exist 
in a problem under study, is called a “model” of the relationship .2 Thus 
in computing, the distance the train has traveled we are assuming that it 
continued to travel at a definite rate, say 45 miles an hour, the whole way, 
and traveled the two-hundredth mile just as fast as the first mile. But if 
we were dealing with something where the change in Y was not the same 
for different values of X, the linear model would no longer be satisfactory. 
For example, an airplane on a long-distance flight has to carry a heavy 
load of gasoline at the start and hence cannot attain full speed; the farther 
it goes the lighter its load becomes and the higher speed it can make. 
In such a case the straight-line formula would not be applicable, since 
the speed of the plane would increase with the distance it had gone. If 
the straight-line formula were used, it would indicate that it would take 
just as long to travel the first hundred miles as the last hundred, whereas 
actually it would take longer than that to travel the first hundred and less 
than that to travel the final hundred. Only a model which included some 
value that properly took into account the change in speed with the change 
in distance could satisfactorily represent this relation. 

The Quadratic Equation. Another -familiar case in which the rate 
at which 'Y increases changes as the value of X increases is that of a 
weight falling to the ground. Since the attraction of the earth near the 
earth is practically a constant, it exercises a constant pull on a falling 
body. Thus, the farther a body falls, the faster it travels. It is just as if, 
in throwing a ball, a boy did not let go the ball for it to travel by its 
momentum but was able to keep shoving against it, adding more and 
more speed to the momentum it already had. Physicists express this 
relation by saying that the velocity with which an object falls is accelerated 
at a constant rate. This equation, therefore, is; 

V = gt 

where g is a constant measuring the force of gravity, V is velocity in feet 
per second, and t is time in seconds. 

The velocity, or speed, increases with every passing moment, and there- 
fore the distance traveled in each succeeding second is greater than the 
distance traveled in the previous second. 

^ A complete model will also contain a term to recognize departures of individual 
observations from the continuous line called for by the theory. For example, if accidental 
errors in recording the train’s passage led to fluctuations in the actual distances from 
those expected, this might be recognized by stating the model Y = a + bX -f e, where e 
stood for the accidental errors. 



74 Sfmple Regression, Linear and Curyllineor 

If we assume that the value of^ in the equation is already known to be 
32, the equation y — g( can then be written F = 32/ 

We can estimate the distance traversed by a falhngbodyineachsuccessive 
second by a process like this 

Let us figure that the average speed for each 2 seconds is the same as 
at the midpoint and then let us estimate the distance traversed m those 
2 seconds by multiplying this average speed by the time Then by adding 
all the distances together we can get an approximation of the total distance 
First we need to calculate the average speed for each period, using the 
last equation, F = 32/ 

End of 1st second, speed = 32(1) = 32 = average speed for 1st two seconds 

End of 3d second, speed = 32(3) = 96 = average speed for 2d two seconds 

End of 5th second, speed = 32(5) = 160 = average speed for 3d two seconds 

End of 7th second, speed = 32(7) = 224 = average speed for 4th two seconds 

End of 9th second, speed = 32(9) = 288 = average speed for 5th two seconds 


Then we can estimate the distance traveled in each 2-second period, 
as follows 


Period 

Average Speed 

Distance in That 

(feet per second) 

(feet) 

1st 

32 

64 

2d 

96 

192 

3d 

160 

320 

4th 

224 

448 

5th 

288 

576 


Estimated total distance 


Since the velocity increases at a uniform rate for each moment of time, 
the true average rate of speed for any period will be just half way between 
the speed at the beginning and at the end ’ 

The average speed during any period of / seconds is therefore 32//2 
The total distance traversed m the / seconds can therefore be determined 
by multiplying the average speed, 32//2, by the total number of seconds, / 
This gives d = 32(//2)/ or rf = 32/’^ = 16/* 

So far, we have assumed that we Joiow the acceleration, or rate of 
increase m velocity per second Suppose instead we had not known it 
to begin with How could we have found it out‘> 

If we had used the symbol g to represent this value, we could have earned 

* This would not be true of aU types of relations Jf, for example, velocity increased 
at a changing rate, the smaller the units taken the more accurate would be the result 



77 


Simple Curvilinear Regression 

out aU the previous calculations, except that we should have used ‘V” 
where instead we have used “32.” * 

Our last formula then would have been 

^ = g(ty2), or = {gliy 

If we let gj2 = b, the equation then w'ould read d = bfi. 

We could readily determine the value for b by observing the distance 
a given body falls in 1 second, in 2 seconds, in 3 seconds, etc., and then 
working out the probable value for the constant, just as has been done 
before. 

In this case it should be noted that the formula d = is derived 
on the assumption that the attraction of gravity is a constant, tending 
to increase velocity at a uniform rate per second, or other unit of time. 
Only if this assumption is correct can the equation be used. The equation 
is directly based upon this assumption; the reasoning used in deriving 
the equation also serves to explain what the constants obtained really 
represent. On the basis of this reasoning the equation determined is not a 
mere empirical expression of the relation between time falling and distance 
traversed. Instead, it is a fundamental measurement of why that distance 
is what it is, and relates it in a logical manner to the attraction of the earth. 

Although it would be quite possible in this particular case to draw 
a freehand curve expressing the relation between time and distance, it 
would not be so satisfactory as the mathematical equation. The curve 
would merely state what the relation was; the equation, in addition, 
explains why it is, in the terms of a particular hypothesis. 

The Parabolic Equation. Another physical case in wliich a definite 
relationship may be established logically, and then measured statistically, 
is the firing of a projectile from a gun. 

Disregarding the resistance of the air, there are three elements which 
will determine the height the projectile will have reached at any given 
instant after it leaves the muzzle of the gun. The simplest of these elements 
is the height of the muzzle of the gun itself, represented by a in Figure 6.2. 
All the subsequent changes in elevation will obviously have to be added 
to that. 

The second element is the rate at which the projectile is moving upward 
at the instant it leaves the muzzle. That is dependent, of course, on the 
angle at which the gun is elevated and on the muzzle velocity. If the slope 
were 10 per cent from the horizontal and the muzzle velocity were 1,000 
feet per second, the projectile would leave the muzzle moving upward 
at the rate of 100 feet per second. If there were no resistance of the air, 
and if there were no force of gravity to pull the projectile off its course, 
its momentum would carry it on in this direction to infinity, as illustrated 



7 fi Sfmp/e Regression, Linear and Curvilinear 

by the straight line m the picture (These directions are, of course, relative 
to the earth’s surface at the moment of firing ) Here b represents the 
increase m elevation the projectile would attain for each additional second 
of flight, and a and bt the elevation it would attain if gravity did not 
influence it 

But gravity is at work too As we have already seen, as soon as a body 
IS released, the pull of gravity tends to move it downward at ever-increasing 
speed Even if it is headed upward as when shot from a gun, the pull of 
gravity starts tendmgtopullit down Thediagramillustrates what happens, 



Fig 4 2. The trajectory of a projectile, illustrating the equation 
Y^a + bX + cX* 


With C used to represent the distance the body would have fallen if it 
had no upward velocity At first the gam m height from its upward 
momentum ts more than enough to offset the tendency to lose height 
because of the pull of gravity, and the projectile moves upward along the 
curved course indicated But finally the loss due to gravity becomes 
greater than the gam from its original upward momentum and the tra- 
jectory gradually turns dovmward, until the projectile finally comes to 
rest in the earth or on its target 

The height that the projectile reaches at any moment is the sum of these 
three components — the original height, the upward course, and the 
loss by gravity Its height, then, can be expressed by adding together the 
three elements 

1 a remains the same, regardless of the time elapsed 

2 B, the height due solely to the onginal momentum, depends on the 
time, increasing as the time increases If we let b represent the initial rate 
of gain m elevation per second of time, B can then be stated B ^ bt 

3 Finally, C depends on the time elapsed, and, as we have just seen, 
varies with the square of time With the same notation as in our falling- 
stone problem, but with C substituted for distance fallen, C =5 — (^/2)/* 


79 


Simple Curvilinear Regression 

Adding these three elements together, we obtain the equation for the 
height of the projectile at any instant, letting H represent height in feef 
H=a + bt + ct^. 

It will be seen that this equation is exactly identical in form with the 
equation for a parabola, Y a + bX + cX^. 

Measurements of the height of the projectile at various given times after 
firing the charge, made for a given gun, firing the same charge at the same 
elevation of the gun, would give a series of X and Y values which could 
be used in computing the constants a, b, and c, even if all were unknown 
to start with. 

If the equation were actually worked out, it would tell much more than 
merely the graph of the relation. For if the reasoning on which the several 
different constants were included in the equation was correct, then the 
equation would furnish a real explanation of why the projectile moved as 
it did, in terms of the laws of motion and of gravity upon which all such 
movements depend. 

If the projectile were an intercontinental ballistic missile, the equation 
expressing its flight would be much more complicated, because it would 
also have to take account of the curvature of the earth over long distances, 
of the continued acceleration until its fuel had been expended, of the 
varying resistance offered by the air as it became thinner at higher altitudes, 
and possibly even of the varying pull of gravity as the missile rose to very 
high altitudes. And if the missile were of the type that can change its 
course — that is, steer itself — until its fuel was exhausted, the calculations 
would have to consider also the elevation and direction needed at that 
moment, and during its trajectoiy thereafter, in order to reach its target. 

By a similar process of logical analysis, we can work out the type of 
equation that should properly fit our auto-stopping example, i.e., serve 
as the model for the relation. We know that when a car is not moving, 
it takes 0 distance to stop. Therefore, when X (speed) = 0, Y (distance) 
should also equal 0. So our curve should have no a constant, in view of 
this logical condition. 

Second, it takes some definite period of time after the signal to stop is 
given for a driver to react, to put his foot on the brake pedal, and for the 
brakes to start to take hold. During that period of time, the car will 
continue to move at its original speed — and the higher that speed, the 
greater will be the distance covered before the brakes are applied. Our 
equation will therefore need a term of bX, to represent this part of the 
stopping distance. 

Third, once the brakes are applied, they will tend to exert a stopping 
force which is the product of the friction of the brakes multiplied by the 
movement of the wheel drums against that friction — and as the wheels 



JO simple Regression, Linear and Curvilinear 

slow down, that movement will constantly become less We may assume 
that this force of deceleration works in the exact reverse of the way the 
accelerating force of gravity works— and that the speed of the car falls 
m a straight line with time, just as the speed of falling increased m a 
straight line So the sum of this force, as measured by the distance it 
takes to stop after the brakes begin to take hold, should vary with the 
square of the initial speed, just as in the gravity case the distance fallen 
vaned with the square of the time So we will need a term to represent 
this portion of the stopping distance 

On the basis of this analysis, we arrive at the equation 

as the model which, fitted to the observations of the auto-stopping example, 
should both fit the observed relauon, and have a logical explanation of its 
two constants, b and c 

Reasoning such as this, carried out to much greater lengths, has formed 
the basis for the scientific “laws” which have been discovered m physics 
and chemistry and expressed m definite equations Methods for detcr- 
mimng the constants have been devised to serve in determining such 
relations But when the same methods are applied to biological, economic, 
educational, or other relationships in the natural or social sciences, 
their value is much more limited Only rarely is there real basis for 
expecting a particular mathematical relationship such as can be expressed 
m a given type of equation In many cases our knowledge of the reasons 
for the relationship are altogether too limited to enable us to say why it 
exists, and even where we can establish the reasons, they arc frequently 
too complicated or too involved to admit of mathematical treatment If 
we express a given relation by a formula, merely on the basis that that 
formula seems to describe the observed relation satisfactonly, we do not 
have any greater knowledge of the relation than if we merely drew in 
a freehand curve The equation is simply an empirical description of 
the relation, of and by itself, it offers no clue as to what the relation 
means 


Practical Procedures for Fitting Curves 

The equations discussed to this point can all be fitted to the data by 
relatively elementary arithmetic operations, as will be shown subsequently 
There are many other types of more compheated equations which cannot 
be fitted so readily These can reproduce curves with recurrent or periodic 
oscillations, growth curves, and other complicated biological or physical 



81 


Simple Curvilinear Regression 

phenomena. Discussion of the use and fitting of such complicated curves 
lies outside the scope of this book.^ 

The inability of any one equation to represent many simple curves may 
be illustrated by taking a new example. Table 6.1 shows a series of 
observations of two variables— the protein content of different samples 

Table 6.1 

Protein Content and Proportton of Vitreous Kernels for Each 
OF A Number of Samples of Wheat* 


Sample Number 

Protein Content 

Proportion of 
Vitreous Kernels 


Per cent 

Per cent 

1 

10.3 

6 

2 

12.2 

75 

3 

14.5 

87 

4 

11.1 

55 

5 

10.9 

34 

6 

18.1 

98 

7 

14.0 

91 

8 

10.8 

45 

9 

11.4 

51 

10 

11.0 

17 

11 

10.2 

36 

12 

17.0 

97 

13 

13.8 

74 

14 

10.1 

24 

15 

14.4 

85 

16 

15.8 

96 

17 

15.6 

92 

18 

15.0 

94 

19 

13.3 

84 

20 

19.0 

99 


* These values are selected cases, picked so as to show the relationship more clearly. 
Actually, the correlation is not so high as is shown here. 

< For examples of such complicated curves and methods of fitting them, see Frederick 
E. Croxton and Dudley J. Cowden, /Ipplied General Statistics, 2d ed., pp. 297 3l«, 
Prentice-Hall, Englewood Cliffs, N.J., 1955. Also pp. 540-571, 1st ::d., Henry Holt 
and Co., N.Y.. 1939. 


32 Simpie Regression, Linear and Curvilinear 

of wheat, as determined by chemical analysis, and the proportion of 
“hard, dark, vitreous kernels” m each sample, as determined by visual 
examination with the naked eye The relation here is quite different from 
the one we have been considermg so far There is no causal connection 
between these two variables m the sense of one’s being caused by the 
other Instead, they are merely two different ways of measuring the 
character of the wheat It is a short, rapid process, however, to examine 
the samples by eye and determine the percentage of hard, dark, vitreous 



Fig 6 3 Dot chart showing relation of proportion of vitreous kernels to 
protein content of wheat 

kernels, whereas it is a long and expensive process to run a chemical test 
on each lot For that reason it is of importance to know whether it is 
possible to estimate the protein content from the percentage of vitreous 
kernels, and, if so, how closely So even though the vitreous kernels do 
not cause the differences in protein, we can still regard the proportion of 
vitreous kernels as the independent variable and the percentage of protein 
as the dependent variable That means only that we are going to try to 
estimate the dependent (protein) from the independent (percentage 
of vitreous kernels) even though there is no direct cause and effect relation 
present 

The relation between the proportion of vitreous kernels and the per 
cent of protein may be seen more readily if a dot chart is made, showing 
the two variables for each of these individual observations We shall 
designate the proportion of kernels vitreous as X, and the percentage of 
protem as T In prepanng the dot chart, shown in Figure 6 3, we shall 
therefore plot the X values, or percentage of vitreous kernels, along the 



Simple Curvilinear Regression gj 

horizontal axis and the Y values, the proportion of protein, along the 
vertical axis; 

It is quite obvious from the figure that a straight line could not represent 
the change in protein with change in vitreous kernels. Some type of curve 
is necessary. In this case we have no prior logical expectation as to what 
type of curve would fit. Let us see if the simple parabola is the proper 
type of curve. 

Fitting a Simple Parabola. To represent the relationship between 
the two variables according to the formula 


Y = a + br + (6.1) 

we shall have to determine from the 20 observations the values to assign 
to the constants a, b, and c, just as before for the straight line we had to 
determine values for a and b. (Of course the a and b for the parabola 
will not be the same as the values would be for a straight line fitted to 
the same data — unless c happens to be zero, which would make the equa- 
tion for the parabola give a straight line instead.) The values for these 
constants are determined by constructing and solving the following 


equations:® 


(Zifi)b -f (Zxu)c ~ Yixy 
(Zxu)b + = Swy 


( 6 . 2 ) 


a^M,- b{M^) - c(M„) 


The values necessary in constructing equations (6.2) and (6.3) are 
derived as follows; 

Use U to represent the values of equation (6.1).® 


Then 


Y.X SL 

Mr = = 

® n n 


Sa;2 = SX2 - nMl 
YjXU = HiXU — nM^M„ 
= SU2- 

2x7/ = YiXY — nM^M^ 
Huy = HUY — nM„My 



(6.4) 


^ An alternative method is to solve the following three equations simultaneously. 
The clerical work is about the same in both methods. 

na -h (SATfr + (S(/)c = 27 
(LX)a -f + (I.UX)c = I.XY 

(ZU)a + (I‘UX)b + (2C/*)c = 2 7f/ 

® If Uis made equal to X^ divided by some convenient number, say 1,000, the volume 
of necessary arithmetic can be materially reduced, without affecting the accuracy of the 
result. 



84 


S/mpfe Rejresjfon, Unear and CurvlUnear 

Afler computing these values, the two equations (6 2) are sol\ed simul 
taneously to obtain the values for b and c, and then these values arc 
substituted m equation (6 3) to obtain the value for a 
Table 6 2 shows the form of computation in the first step to obtain these 
values for the data of Table 6 1 (the computations are omitted for the 
central 16 observations) 


Table 6 2 

Computation for Wheat Problem of Values Needed to 
Determine Constants of the Simple Parabola 


Vitreous 

Kernels 

X 

Protein 
(minus 10) • 
Y 

A’*and U XU 

C* 

Jirr 

UY 

Per cent 
6 

Per cent 
03 

36 

216 

1,296 

1 8 

108 

75 

22 

5 625 

421,875 

31,640,625 

165 0 

12,3750 

84 

33 

7 056 

592,704 

49,787,136 

277 2 

23,284 8 

99 

90 

9 801 

970 299 

96,059,601 

8910 

88,2090 

1 340 

68 5 

107 566 

9 259.238 

824,403,226 

5,985 6 

542,5846 


* To simplify the following catcutations 10 0 has been subtracted fcom each protein 
reading 

The values at the foot of the table give the values called for in equations 
(6 4) Substituting the values as computed for those shown symbolically, 
the arithmetic appears as follows 

LT 1,340 

Af, = — = = 67 

n 20 

n 20 

L£/ 107,566 

w. = — = = 5,378 3 

n 20 

- tiMl = 107,56« - 20(67)» = 17,786 
ZXU - = 9,259,238 - 20(67)(3,378 3) = 2,052,316 

ZIP- nMl = 824,403,226 - 20(5,378 3)» = 245,881,008 
ZXr- nM.M, = 5,985 6 - 20(67)(3 425) = 1,396 1 
Zur - nU^M, = 542,584 6 - 20(5,378 3)(3 425) = 174,171 05 





Simple Curvilinear Regression 

These calculations give the values needed in equations (6.2) which are 
to be solved simultaneously to obtain the values of b and c. Substituting 
the values just computed in the equations gives the two equations to be 
solved as follows: 

(A) + (SarM)c = ILxy 17,7866 + 2,052,316c = 1,396.1 

(B) (lxu)b + (Zu^)c == (Zuy) 2,052,3166 + 245,881,008c = 174,172.05 

The simplest way to solve these is by the Doolittle method, as indicated 
in Appendix 2 pp. 489 to 497. 

Solving the equations simultaneously gives 6 = —0.0879, c = 0.001442. 
These values are then substituted in equation (6.3) to obtain the value 
for a. 

a = M^ — b{M^ — c(Af„) 

= 3.425 - (-0.0879)(67) -f (0.00!442)(5,378.3) 

= +1.56 

With our values for a, b, and c, we can now write out the equation for 
the parabola, Y = a + bX + cX^, for this particular case as follows: 

Y = 1.56 - 0.088A' + 0.00144A'2 

Since 10 was subtracted from the percentage of protein before calculating 
the equation,’ to estimate the actual percentage 10 must be added back 
in, making the equation read 

(I) r = 1 1.56 - 0.088 JT + 0.00144^^2 

This then is the equation of the simple parabola which comes nearest 
to describing the relationships between Y and X. From it the percentage 
of protein in a given sample of wheat may be estimated from the percentage 
of hard, dark, vitreous kernels in that sample. 

We can see how the estimates are made by working them out for some 
of the samples. If we take the value of X for the first sample in Table 6.2, 
and substitute it in equation (I) we obtain an estimated value for Y as 
follows: 

When X = 6 

Y= 11.56 - 0.088(6) + 0.00144(36) = 11.08 

Substituting each of the values of X in the formula in turn in a similar 
manner, we obtain estimated values for Y as shown in Table 6.3. So as 

’ This does not affect the values obtained for 2(a^), Sfso/), etc. 



Simple Regression, Linear and Curvilinear 

to distinguish between the actual values of Y, and the values for y esti 
mated from jy according to the equation of the parabola, we shall designate 
the latter as Y' values 


Table 6 3 


Comparison, tor Wheat Problem, of Actual Protein Content 
WITH Protein Content Estimated From Per Cent of Vitreous 
Kernels on Basis of the Simple Parabola 


Vitreous 

Kernels, 

X 

Protein 
(mmus 10), 

Y 

Estimated Protein 
(minus 10), 

Y 

Difference Between 
Actual and 
Estimated Protein, 
Y~Y 

Per cent 

Per cent 

Per cent 


6 

03 

1 08 

-0 78 

75 

22 

3 06 

-0 86 

87 

45 

4 80 

-0 30 

55 

1 1 

1 08 

+002 

34 

09 

0 23 

+0 67 

98 

8 I 

679 

+ 1 31 

91 

40 

5 50 

-150 

45 

08 

0 52 

+028 

51 

1 4 

0 83 

+0 57 

17 

I 0 

048 

+0 52 

36 

02 

0 26 

-006 

97 

70 

660 

+040 

74 

38 

2 95 

+0 85 

24 

01 

0 28 

-018 

85 

44 

4 51 

-Oil 

96 

58 

641 

-061 

92 

56 

5 68 

-008 

94 

50 

604 

-I 04 

84 

33 

4 35 

-I 05 

99 

90 

6 99 

+2.01 


We can plot the actual and the estimated values on a dot chart (Figure 
6 4), using dots to represent the values of Y originally observed and crosses 
to represent the estimated values, Y' Ibe crosses all lie on a continuous 
smooth curve, which we can sketch m freehand, as indicated by the dotted 
line in the figure To estimate the protein for a sample with a proportion 
of vitreous kernels not included m our problem, say 65, we can substitute 
65 for X in equation (I), and compute it out, or read from our smooth 
curve the Y value correspondmg to an X value of 65 This graphic 



Simple Curvilinear Regression 

interpolation will not be quite exact, but for many purposes it will be 
sufficient. 

Let us now examine Figure 6.4 and decide whether the formula for the 
parabola gives a satisfactory “fit” in this case— whether the estimated 
values do agree fairly well with the actual. The curved line does come 
closer to agreeing with the actual values than any straight line could, but 
the shape of the parabolic curve and the general trend of the actual relation- 
ship is rather different. 



Vitreous kernels (per cent). X 


Fig. 6.4. Dot chart showing relation of vitreous kernels to protein content of 
wheat, and parabolic curve fitted to same. 


Apparently the equation of the simple parabola is not adequate to 
describe this particular relationship. Especially for high proportions of 
vitreous kernels, the estimates are quite inaccurate. Between 70 and 90 
per cent vitreous kernels, the estimates of protein are all too high, with 
only one exception. For 99 per cent vitreous, the parabola would estimate 
17.0 per cent protein, whereas both samples over 97 per cent vitreous 
kernels had over 18 per cent protein. The failure of this curve to give 
a satisfactory “fit” is not due to any error in the computations but merely 
to the fact that this formula cannot give the proper-shaped curve to fit 
the relationship in this case. The mathematical properties of the equation 
itself are such that, no matter what constants are used for a, b, and c, 
it cannot come any closer to describing the true relation. 

Fitting a Cubic Parabola, The cubic parabola, type (f) of the 
equations on page 70, might be tried to see if it would describe this par- 
ticular relationship more closely. 




S/rrp/e Pegressfon, Linear and Curvilinear 


The equation of the cubic parabola, 

Y=a + bX+cX^-i-dX^ (65) 

has four constants a, b, c, and to be computed Here again, of course, 
a, b, and c will be different from those we have computed previously, 
unless the d value comes out zero The values b, c, and d are computed 
by the simultaneous solution of the following three equations ® 

Use U to represent the of equation (6 5) and V to represent the 


+ (2®tt)c + (^v)d = Sary 
(2xu)b + (Sn*)e + (2«u)rf = 2uy 
(Zxv)b + {2mp)c + = Sty 


(6 6 ) 


The value for a is then computed from the following equation 


d = - b(M^ - c(A/J - diM,) (6 7) 


The values for Sa:®, Sant, Say. Su^, and Suy are computed as shown 
previously equations (6 4) The additional values required m equation 
(6 6) are computed as follows 



Sxp = SATK - 
Sty = S^'r - nA/„A/^ 


( 68 ) 


It should be noted that among the values required to “fit” this cubic 
parabola, that is, to determine the constants a, b, c, and d, arc such 
values as Sf'* and SC/K Remembering that V = X^, and C/=A’*, 
we need to calculate and AT® For X = 10, AT® = 1,000,000, so for 
values of X such as those m Table 6 1, ranging from 6 to 99, it would 
take a tremendous volume of compulation to compute the values required 
m equations (6 6), (6 7) and (6 8) This may be reduced by letting U = 
Af^/lOO, and r=A^/10 000 The computation is not shown here in 
detail It follows the general form of that given m Table 6 2, and the 

• Hie alternative method here involves the simultaneous solution of 4 equations 
as follows 

mz + (SAT* + (SCOc + (2 = 2 1' 

(EATa + (EAr>)& + (SA'lOc + (SA-IO^ = SA'r 
(StOfl + E(t/Ar)6 + + (EC/F)i/ = 2C/r 

(SF)a + (SKAT6 + (E£rtOc+ (2:p)rf= EKK 



89 


Simple Curvilinear Regression 

solution of the equations (6.6), starting in just as shown on pages 178 
to 180, may be most conveniently carried through by the methods shown 
subsequently in Appendix 2. 

Even when the cubic parabola is “fitted” to the data given, however 
it does not give a satisfactory “fit.” Thus Figure 6.5 shows’ the cubic 



Vitreous kernels (per cent), X 


Fig. 6,5. Dot chart, with parabola and cubic parabola. 

parabola fitted to the data, worked out as just described. The values found 
gave the equation 

y = 0.35 + 0.0345^^- 0.1397(^2/100) + 0.1788(^3/10,000) 
or, clearing of fractions, 

7= 0.35 + 0.03457 - 0.0014A"2 + 0.00001 87^ 

Adding in the 10 which was subtracted from Y before making the com- 
putations, the equation becomes 

7= 10.35 -h 0.03457-0.001472 -f 0.00001873 
In Figure 6.5, the original observations are represented by dots, the 
estimated values from the cubic parabola are represented by crosses, and 
the curve of the simple parabola is also shown. A curve has been drawn 
through the crosses to show the general shape of the cubic parabola. 

The last curve comes much closer than the previous curve to describing 
the relationship which actually exists. Even so, however, it is not entirely 
satisfactory, for it gives estimates which are still too low at the very highest 
percentage of vitreous kernels. Except for this portion, and the slight 
downturn between values of 20 and 40, it seems quite satisfactory. 

There are still other types of curves, however, some of v/hich might 




90 S/mple Regression, Linear and Curvilinear 

give better fits than the ones we have tned For instance the fourth-order 
parabola, 

y = a + + eX^ 

can be fitted by an extension of the methods just described, as can para- 
bolas with even more terms Those are rarely useful, however, as the 
greater the number of terms, the greater the tendency becomes for the 
curve to ‘‘wiggle ” In addition, the volume of arithmetic required becomes 
extremely burdensome — the computations for the fourth order parabolas 
involving powers of X up to Jf* 

Furthermore, there are only a limited number of observations, 20 in 
all If a parabola were fitted with 20 constants, for example, it would 
simply twist and turn so as to pass through every observation Since it 
would simply reproduce these 20 observations, it would be of no value at 
all m indicating the relation which probably holds true in the universe 
from which the observations in the sample are drawn (See Chapters 18 
and 22 for standard errors of the coefficients of a fitted curve, which indicate 
Its sampling significance, and provide a basis for calculating its confidence 
limits ) 

Fitting Lines or Parabolas to Time Series. In studying time senes, it 
is sometimes desirable to fit a straight hue or a curve to the successive 
observations as a means of determining the long-time trend The tech- 
niques of time-senes analysis for individual variables lie outside the scope 
of this book, and therefore are not given special consideration here Some 
special problems involved in correlating two or more tune series are 
considered in Chapter 20 • Fitting a mathematical trend to a time series 
involves regarding the successive months or years as values of the X, 
or independent, variable The fact that these values are regularly spaced 
1, 2, 3, 4, etc , and that the same succession occurs in many problems, 
makes possible special methods and special tables, which greatly reduce the 
labor of fitting the equations This method of computation, known as 
orthogonal polynomials, should be used in determining lines or parabolic 
curves for such data 

fitting a Logarithmic Curve. Some of the other types of curves 
meoti o ned oapa^TQ, 

rithms, and type (e), using reciprocals, may be fitted with relatively 

* An excellent discussion of the methods and meaning of time-series analysis for 
individual series is given by Fredrick C Mitk in his textbook, Statistical Methods, 
3rd cd , Chapters 10. 1 1. and 12, Heniy Holt, New York, 1955 

For methods of fitting orthogonal polynomials, see Frederick E Croxton and 
Dudley J Cowden, Applied General Statistics, 2d ed , pp 289-90, Prenticc-Hall, 
Englewood Cliffs, NJ, 1955, and R A Fisher. Statistical Methods for Research 
Workers, 12lh ed , pp 147-56, Oliver and Boyd, Edinburgh and London, 1954 



91 


Simple Curvilinear Regression 

little computation- The methods of fitting several of these tsp»s mav 
be shown for the present case, even though they may fail to sive anv better 
fit than the cur\'es which have already been computed. 

The three simple types of logarithmic curves, (b), (c), and (cf) may all 
be fitted by exactly the same method previously used in fitting a straight 
line, except that the logarithms of X, of T, or of both together are employed 
where otherwise the values of the variables themselves are used. Com- 
parison of the straight-line formula with the logarithmic formula indicates 
how this is done. 

If we use 7 to represent the logarithms of the Y values, and .? to 
represent the logarithms of the X values, our equations will change as 
follows; 

log Y = a -h bX. to 7 = a -Y bX 
log, Y = a -Y b \og X, to 7 = a -Y bX 
Y = a + b log X.to Y = a -Y bX 

In each case it is evident that the new equation is identical in form with 
the simple straight-line equation, 

Y = a + bX 

and the same methods may therefore be used in determining the constants 
a and b as were used earlier in equations (5.2) to (5 5). 

Some indication as to w'hich one of the three logarithmic formulas v.ill 
come nearest to fitting a given set of data can be obtained by con\erting 
both the X and Y values to logarithms, variables X and 7, and then 
making dot charts of 7 against X, of 7 against X, and of Y against X. 
If one chart shows the dots falling in substantially a straight line, the 
equation corresponding to that chart w'ili give the most satisfactoiy fitX 

The first step in applying any one of the three logarithmic equations 
to the data of the wheat example is to work out the logarithms and con- 
struct the three dot charts, to indicate which formula to use. The form 
of computation is shown in Table 6.4. 

It should be noted that in working out the logarithms nothing can be 


0 ) 

0 ) 

id) 


This is strictly true only if the “goodness of fit" is measured in terms of the loga- 
rithms used. 

Logarithms may also be used with parabola of higher orders, such as: 

Log Y = a -Y bX -b cX^ 

Such involved curves will not be considered at length in tliis book, however. 



92 


Simple Regression, Linear and Curvilinear 


Table 6 4 


Variables in Wheat Problem and Logarithms of Values 


Protein 

Y 


Logarithms of Variables* 

X 

Protein, Vitreous Kernels 

? X 


Per cent 

10 3 

Per cent 

6 

f013 

0 778 

12 2 

75 

1086 

I 875 

14 3 

87 

1 161 

1 940 

11 I 

55 

1 045 

I 740 

IQ 9 

34 

1 037 

1 531 

18 I 

98 

1 258 

1 991 

140 

9i 

1 146 

I 959 

108 

45 

1 033 

1 653 

114 

51 

1 057 

1708 

no 

17 

1041 

1230 

102 

36 

1 009 

1 556 

170 

97 

I 230 

1 987 

138 

74 

] 140 

1 869 

101 

24 

1 004 

1 380 

144 

85 

1 158 

1929 

158 

96 

1 199 

I 982 

156 

92 

1 193 

1964 

150 

94 

1 176 

1973 

133 

84 

1 124 

1 924 

190 

99 

1 279 

1 996 


• Logacithms to base 10 


added or subtracted from any of the \anaWes (except for rounding off 
decimals) In all the previous work the protein had been stated as protein 
m excess of 10 per cent but now the original percentage figures are used 
once more That is because logarithms deal with relatiie values, and the 
ratio of 1 to 2 is quite different from that of II to 12 All the previous 
equations have dealt with absolute values or differences from the average, 
and the absolute difference between 1 and 2 is of course just the same as 
that between 11 and 12 

After the logarithms are once computed however they can be ‘coded by sub- 
tracting a constant or by division just as other variables have been treated formerly 
with the same effect on the final constants obtained 



Simple Curvilinear Regression 

Figure 6.6 shows the three dot charts in which the three different ways 
of combining the logarithmic and actual values are shown. None of the 
three gives a very close linear relation, but the one where F and X arc 
plotted seems to come nearest. The equation 

log r = a + 6X, or f = a + bX 
will therefore be used. 

The values necessary to determine a and b are as follows, usine equations 
(5.2) and (5.3): 

SZF, A/j, 



LogXW 


Fig. 6.6. Dot charts illustrating log Y — f{X)-, Y = /(log X); log Y — /(log X). 


Table 6.5 shows the form of computation of these values from the 
original values of the two variables, with part of the observ'ations omitted. 
This computation gives the values necessary to compute a and b. 

The averages of X and f of course are: 


’ n 20 






94 


Simple Regresilon, Unear and CurvUinear 


Table 6^ 

CO'JPUTATION. FOR WljEAT PROBUXJ, OF VALUES NEEDED TO 

Determine Constafos for Logarithmic Curve 


Protein 

y 

Vitreous 

Kernels, 

X 

Logarithms of 

Extensions 

? 

A'* 

X? 

Per cent 

Per cent 




10 3 

6 

1 013 

36 

6 078 

122 

75 

! 086 

5,625 

81 450 

14 5 

87 

1 161 

7,569 

101 007 

190 

99 

1 279 

9 801 

126 621 

Sums 

Sy = 1,340 

EP- 22 389 

SA'2 = 107,566 

SJSrf = 1,545 888 


Then, by equation (5 2) 

ZX? - 1,545 888 - 2(K67)(I 11945) 

® lA'* - nM; 107,566 - 20(67)* 


0002576 


and by equation (5 3) 

a = Mg- b{M,) = 1 11945 - (0 002576)(67) = 09469 
In terms of the variable, the equation required is therefore 

?^a + bX = 09469 +Q(yi2S16X 
or 

log r = a + = 0 9469 + 0 002576Ar 

The percentage of protein can now be estimated from the proportion 
of vitreous kernels observed for any sample of wheat, by substituting 
the percentage of vitreous kernels (the X values) m this equation and 
wQtkjm^ U out. Tbjas Coc tbA ftcst 6. p/te cu?/, oC vAtwiS. 

kernels, it would work out as follows 

log y = 0 + = 0 9469 + 0 002576(6) = 0 9624 

Using a table of logarithms we find that the number corresponding to 
the logarithm 0 9624 (that is to say, its antilogarithm) is 9 17 The 
estimated proportion of protein is therefore 9 17 per cent 
Similarly if the proportion of vitreous kernels in the second sample, 



Simple Curvilinear Regression 9S 

75 is substituted in the equation, the work to calculate the estimated 
proportion of protein is; 

loz Y = a + bX = 0.9469 + 0.002576(75) 
log 7= 1.1401 
antilog 1.1401 = 13.81 

The estimated proportion of protein is therefore 13.81 per cent. 

Table 6.6 shows this computation carried through for each of the 20 
observations. 


Table 6.6 


Computation, tor Wheat Proble’.i, of Estimated Protein- 
Content FROM Per Cent of Vitptous Kernels on the 
Basis of a LoGARirm.nc Cur\'e 
(Log 7 = 0.9469 + 0.00258 X) 


Estimated Protein 


Vitreous 

Kernels, Estimated 
logarithm, 
7 


Antilog of 
Estimate, 
7' 


Percentage Errors 
in Estimating 
Actual Protein, Protein Pro- 
portion, 

7 ^ 


Per cent 
6 
75 
87 
55 
34 
98 

91 
45 
51 
17 
36 
97 
74 
24 
85 
96 

92 
94 
84 
99 


Per cent 


0.9624 

1.1401 

1.1710 

1.0888 

1.0345 

1.1993 

1.1813 

1.0628 

1.0783 

0.9907 

1.0396 

1.1968 

1.1375 

1.0087 

1.1659 

1.1942 

1.1839 

1.1890 

1.1633 

1.2019 


9.2 

13.8 

14.8 
12.3 

10.8 
15.8 

15.2 
11.6 
12.0 

9.8 

11.0 

15.7 

13.7 

10.2 

14.7 
15.6 
15.3 

15.5 

14.6 
15.9 


Per cent 

10.3 
12.2 
14.5 
11.1 
10.9 
18.1 

14.0 
10.8 

11.4 

11.0 
10.2 

17.0 

13.8 

10.1 

14.4 

15.8 
15.6 

15.0 
13.3 

19.0 


-ri2.0 
- 11.6 
- 2.0 

- 9.8 
-r 09 
- M -.6 

- 7.9 

- 69 

- 5.0 

-rl22 

- 7.3 
-f S.3 
.f 0.7 

- 1.0 
- 2.0 
-f 1.3 

-f 2.0 

- 3.2 

- 8.9 
-fl9.5 



95 Slmj>le Regression, tfneor and Curvilmeor 

It should be noted in this table that errors made m estimating the 
proportion of protein are stated as relative errors rather than absolute 
errors That is done because the thing that is really estimated is the 
logarithm of the percentages of protein, or 7, and the errors are really 
the differences between the actual logarithms and the estimated logarithms 
If z IS used to stand for the error, in this case z is really m terms of 
logarithms, that is 

2 = log y — estimated log F, or f — 7' 
or m terms of natural numbers 

antilog P actual Y 
anting 2 a[,(,jog p' estimated F 

Subtracting the constant 1 00 and multiplying by 100 changes this 
relative figure to the percentage the observed value is above or below 
the estimate 

Where log F is taken as the dependent variable, as has been done here, 
fitting the equation by the methods just shown involves making the 
sum of squares of the hgaruhmic residuals around the line as small as 
possible Instead of minimizing the sum of the absolute errors, squared, 
as heretofore, we now minimize approximately the sum of the percentage 
errors, squared 

In Figures 6 7 and 6 8 the actual proportions of protein, shown as dots, 
are compared with the estimated values as worked out by the logarithmic 
relation In the first of these figures the actual and estimated values are 
both stated m terms of the logarithms It is quite apparent here that this 
equation assumes a straight-line relation between the proportion of 
vitreous kernels and the logarithms of the proportion of protein, since 
they were computed by a straighl-line equation (log F = a + bX) the 
estimated values all he along the continuous straight line indicated The 
next figure, however, compares the actual proporlion,of protein with the 
estimated, both stated in actual terms Here the continuous curve which 
the logarithms produce m the estimated actual values is clearly shown 
The relation between the proportion of vitreous kernels and the percentage 
of protein, as shown by this curve, does not agree with the actual relation 
as shown by the original observations even as closely as did the previous 
curves computed by means of parabolic equations 

Before discussing other ways of expressing the curvilinear relation it 
might be well to discuss the procedure to determine the constants a and b 
if either of the other two forms of simple logarithmic equations were used 

” The reason for making this distinction wll be seen later on, when the question of 
measuring the accuracy of the estimate is taken up 



Simpis Curvilinear Regression 

97 

If the equation r=a + hlogXh employed, the form Y= a + bX 
is used. 






Fig. 6.7. Dot chart show- 
ing observations and fitted 
line for equation log Y ~ 
a + bX, in logarithms of 
Y. 


Fig. 6.8. Dot chart show- 
ing observations and fitted 
line for equation Y = 
natural values 

of Y. 


The values which must be computed are 
M^, M^, XYX, 

and the constants are determined from the equations 

ZYX-nMJrij 
~ ZX^ - nM'i 

a = Mj, — bM^ 

Since the equation is in terms of T itself, the estimated values, computed 
from the logarithms of X, will be directly in values of Y, and vnil not hate 
to be converted to the antilogarithms. 

If the equation log Y = a -Y b log X is to be fitted, the form ) — 
a + bXis used. 

The values which have to be computed are; 




98 Simple Regression, Linear and Curvilinear 

and the constants are determined from the equations 
JLTX-nMgMf 

* " - nA/; 

a = Mg — bMx 

In this case the equation is in terms of f, the logarithms of Y, and the 
estimated values will therefore have to be converted from loganthms into 
natural numbers to show just what the relationship is, just as was done 
m the case that was worked out in detail earlier 
No matter which one of the three logarithmic curves is employed, the 
arithmetic is the same as in determining the simple straight line, with the 
exception of computing the logarithms and of substituting the appropnate 
logarithms where the natural values would otherwise be employed 
In cases where other modifications of the straight-line equation, such 
as type (e), are to be used, the process is to transform the equation to a 
linear form, then compute the constants just as before 
Thus the type 

a + bX 

can be converted to the form 

\fY^a-^bX 

or, letting \jY = Q, 

Q^O’^bX 

The computation can then be earned out m the usual \vay, and after 
the estimated values oi Q, Q , are worked, con\crted back into Y values 
by the equation K = 1/0 

fitting a Conditioned Parabola. Sometimes the logic of a problem 
requires that one of the constants in an equation be 0 In mathematical 
language, we impose a “condition” upon the curve of relationship over and 
above the usual conditions implied in the least-squares method For 
example, as discussed earlier, the auto-stoppmg relation (Table 3 1) 
should logically be expressed by a curve of the equation 

Y=bX+cX^ (69) 

Let us fit that curve, and see how well it represents the data 
To determine the constants b and c, we will let U— A'*/100, and 
c' — 100c, and solve the two simultaneous equations* 

'Lx^b + ^xuc' = 

T.XUb -F = 2:((/r) J 


(6 10) 



Simple Curvilinear Regression 

The form of computation of the necessary sums is shown in Table 6.7 
but the detailed extensions are entered for only 5 of the observation! 


Table 6.7 

Computations for Fitting Equation Y — bX -t cX~ to 
Auto-Stopping Data 


X 

Y 


u, 

A^/100 

XU 

XY 

U" 

UY 

5 

2 

25 

0.25 

1.25 

10 

0.0625 

0.50 

10 

8 

100 

1.00 

10.00 

80 

1.0000 

S.OO 

35 

107 

1225 

12.25 

42BJ5 

3745 

150.0625 

1310.75 

22 

35 

484 

4.84 

106.48 

770 

23.4256 

169.40 

40 

134 

1600 

16.00 

640.00 

5360 

256.0000 

2144.00 

Sums 


28719 


7964.51 

66358 

2411.8383 

19536.51 


(These computations can be made more easily from ihe grouped data 
of Table 4.1.) 

Substituting the necessary sums in equations (6.10;, we have 
28,7I9h + 7,964.51c' = 66,358 


7,964.51^ + 2,41 1.84c' = 19,536.51 


Solving the equations simultaneously (one convenient procedure for 
this is shown on page 173), we obtain the values h = 0.76192, 
and c' = 5.58420. c is therefore 0.055842, and our equation for the cur\e 
is 

Y = 0.7619A^ + 0.05584A'2 


Calculating the values of Y' for selected values of X, we obtain 


X 

Y 

X 

Y 

5 

5.2 

25 

53.9 

10 

13.2 

30 

73.1 

15 

24.0 

35 

95.1 

20 

37.6 

40 

119.8 


In Figure 6.9 this new curve has been plotted in comparison v ith the 
original group averages and their confidence interv'als for P = 0.67, as 
shown earlier on Figure 4.3. Throughout almost its entire length, the 
new curve lies within these confidence intervals of the group averages, and 


100 


Simple Regression. Linear and Curvilinear 

also frequently intersects the irregular line of the original group averages 
The curve, logically derived, therefore appears to fit the trend of the 
original observations far better than any straight line could, and to confirm 
the logical analysis on which the type of the equation was based It can 
therefore be used with reasonable safety for extrapolations beyond the 
speeds observed in the sample For example, it would indicate that at 
60 miles an hour the distance required for stopping would average 247 



fig 6 9 Logical curve fitted to speed of auto and distance it takes to stop, 
compared to line of group averages and thtir confidence intervals 

feet, and at 90 miles, 521 feet Such extrapolations would become highly 
unreliable, however, if speeds were considered that were so high that 
additional factors not present at lower speeds, such as wind resistance, 
became important 

The values we have obtained for our constants b and c and the estimated 
distances based on them, arc of course derived from the relations shown 
in this Single sample of 63 observations They are still only statistics, 
and may vary perceptibly from the true parameters in the entire universe 
Methods of computing confidence intervals for these values will be 
presented later (pages 283-287) 



Simple Curvilinear Regression 

Limitations of Equations in Describing Relationships 

Ujp to this point an expression of the relation between the proportion 
of vitreous kernels and the proportion of protein in each sample has been 
worked out on the basis of a number of diiferent mathematical formuias. 
Each different equation has given a different curve. Some, such as the 
cubic parabola or the logarithmic curve, have given cur.es coming 
somewhere near to the relationship shown by the actual obser- ations 
themselves; others, such as the simple straight line, have entirelv failed 



Fig. 6.10. Original observations, and several diUerent types of fitted curves. 

to describe the relation. Yet the exact slope or shape of each curse '-vas 
determined from the same set of observations; the constants of each 
curve were determined by “fitting” the same data, the di’ ersir}' in thr 
shape of the different curves is strikingly shown in Figure t>.10. witerc 
the several different curves are all drawn on one scale, and the ungmal 
observations are shown as well. It is quite apparent that the dif 1 ereric..i 
in the shapes of the several curves are due solely to the particular fo, n. oi 
equation used in computing them. There are certain lypcj oi 
which can be accurately represented by each of these equations. 



102 Simple Regression, Linear and Curvilinear 

It IS “fitted” to data where that type of relation is really present, it can 
give a curve which accurately represents the central tendency of the data 
But when the same equation is fi|tcd to data for which the underlying 
relationship follows a different function, the resulting curve gives only a 
distorted representation of the true relation— if shoMs the relation only 
insofar as it is possible to do so u ithin the limits of the particular equation used 

So far there has been no attempt to show what there is m the “nature” 
of relations which may make them of the type to be represented accurately 
by one type of equation or by another Instead, the purely empirical 
test of the way each one fits has been relied upon If, as judged by the 
eye, the relation shown by the fitted curve looked like the relation shown 
by the onginal observations, we have said it gave a satisfactory fit, if 
It has not looked Lke it, we have said it did not give a satisfactory fit 
And in this particular case, none of the computed curves has been really 
fully satisfactory — we can readily see that there might be some other 
smooth continuous curve which would come much closer to the actual 
observations than does any of the curves so far computed 

Of course we might continue the process, using more and more complex 
equations or other ways of stating the variables, “ until finally we found 
one which did satisfactorily desenbe the relation Or it might be that the 
underlying curve was so complex that it could not be represented in 
elementary algebraic terms So long as the equation had been derived 
merely by the “cut and*tty” method described, it would have no logical 
meaning beyond serving as a simple device for estimating values of the 
one variable from known values of the other and would throw no particular 
light upon the real or inherent nature of the relation Sometimes it is 
found that two different types of equations may give almost identical 
estimates within the range of the observations fitted Which one expresses 
the “true” nature of the relation Merely because a given equation can 
reproduce a certain relation is no proof that it really “expresses” the 
nature of the relation To establish this, Wb need a logical explanation 
which leads to the given equation, which in turn does closely fit the central 
tendency of the observed data 

If, however, it is not desired to determine what the “real nature” of 

“ A good fit can be obtained in this problem by stating the relation a different way, 
I e , by relatmg the per cent of protein to the Ic^nthm of the non vitreous kernels The 
equation used ts then 

r = a-6Iog(l - X) 

See Frederick F Stephan. Alternative statements of percentage data m the fitting of 
logarithmiccurves.yoi/ma/o/f/i^^^mericanS’tafuticaf/tssacianoR.Vol XXVI, 05 58-61. 
1931 

“ An example of this may be seen in the bulletin by Hugh KiUough What makes the 
price of oats. US Deparlmeni of Agnculiure Btilletm 1351, p 8,1925 



103 


Simple Curvilinear Regression 

the relationship is, but it is merely desired to express it sufficiently weU 
so that values of one variable (such as protein content) can be estimated 
from known values of another (such as the proportion of vitreous kernels) 
it does not make any difference what tj-pe of equation is used, so lon^ as 
it represents the observed relationship adequately. As a matter of fact 
it is not necessary to have an equation at ail. All that is reaUy necessaiC is 
a graph of the curve, or a table of values for one variable corresponding 
to values of another, from which we can construct a graph. Further 
by enlarging the chart and making the scale sufficiently detailed, we may- 
read off the estimated values to any degree of accuracy that is desired 
—much more accurately, as a matter of fact, than our ability to determine 
the real relation usually justifies, as will be evident later on. 

In many cases simply the working expression of the relation may be 
all that is either needed or desirable. The “true relation” between the 
variables may be so involved that a very complex mathematical expression 
would be required to represent it properly. Even simple ty-pes of phvsical 
relations may require rather complex equations to represent them. In 
many cases, too, the knowledge of the causes of the relation may he so 
undeveloped that there is no real basis for expressing the relationship 
mathematically. The relation between vitreous kernels and percentage 
of protein would be an example of this type— very complex details of 
chemical content and physical and biological structure are probably 
responsible, so complex as to be quite beyond satisfactory reduction 
to mathematical expression. Yet the original observations undeniably 
indicate that there is some sort of definite relation. For many practical 
purposes it may be entirely satisfactory merely to know what the relation- 
ship is, without bothering at all with what it really means. Even in scientific 
study that may frequently be satisfactory as a first step, since in many 
cases it is essential to know what are the facts before trying to work out 
the reasons w/if they are as they are. 

When the expression of the relation is not to be used except as an 
empirical basis for estimating values of the dependent variable from the 
independent, a curve can be determined with only a small fraction of the 
effort required in “fitting” a mathematical equation, yet v/nich fits the 
data quite as well as any mathematical curve. In such cases the curve 
may afford quite as satisfactory a description of the relation and a basis 
for estimating one variable from the other as if elaborate computations 
had been made. This method is known as freehand smoothing. 

Expressing a Curvilinear Relation by a Freehand Curir-e. The 
process of determining a freehand curve may be simply illustrated. The 
simplest way to do it would be to plot the original observations on coordi- 
nate paper, and then draw a continuous smooth cun'e through them by 



104 Simple Regression, Unear and CurYlllnear 

eye m such a way as to pass approximately through the center of the 
observations all along its course Where the nature of the relation is 
indicated as closely by the onginal observations as it is in the wheat 
problem which we have been discussing, this might yield quite a satis- 
factory expression of the relation In other cases, however, the observations 
might be more widely scattered, and the underlying relation might be 
more difficult to determine, so that different persons, drawing m the 
curves freehand, might draw m rather different curves Some method is 
therefore needed to give a greater degree of precision to the result, and 
to insure that the same data would yield substantially the same result 
even in the hands of different investigators 

This stabibty of result can be secured by a relatively minor extension 
of the methods already discussed in the first illustration of a two-variable 
relationship— the automobile stopping problem There it was found that 
by classifying the observations tn appropriate groups, the general nature 
of the relation could be expressed by an irregular line connecting the 
several group averages All that is needed is some method of deriving a 
continuous smooth curve from that irregular bne The method for 
smoothing out that irregular line, freehand, is very evident and simple 
At the same time starting with the irregular line of group averages gives 
a certain stability to the process and insures that different persons would 
draw in the curve with about the same position and shape 

Applying the process to the wheat problem, the first step is to classify 
the data into appropriate groups according to the values of the independent 
variable the proportion of vitreous kernels, and to determine the average 
percentage of vitreous kernels and of protein content for the observations 
falling into each group The discussion of the automobile problem has 
shown that for the differences in averages to be significant, it is necessary 
for the groups to be large enough so that the averages would not vaiy 
erratically from group to group In some cases a little expenmenting 
might be necessary to determine what this size would be In the present 
case, inspection of the dot chart showing the original observations (Figure 
6 3) indicates that a class interval of 25 per cent of vitreous kernels will 
give groups large enough to make the averages of protein content fairly 
stable from group to group 

The form of computation most convenient for obtaining the group 
averages, using groups of the size suggested, is shown in Table 6 8 

The averages for the several groups are shown in Figure 611, indicated 
by circles, and original observations arc again shown by sohd dots A 
smooth continuous dashed curve has been drawn through the senes of 
group averages, ignonng the individual observations and following only 
the general trend shown by the averages This smooth curve comes quite 



Simple Curvilinear Regression 


105 


Table 6.8 


Computation of Averages to Use in FirnxG Freehand Curve 
FOR Wheat-Protein Problem 


Vitreous Kernels 
Below 25 Per Cent 


Vitreous 

Kernels 


Vitreous Kernels 
25^9 Per Cent 

Vitreous . 

Kerails 


Vitreous Kernels 
50-74 Per Cent 


Vitrecui Kimels 
T5-IC0Pcr Cert 


rwczTJCis X .-x xvcmcu , . jcerT^**: 

(percent) (Percent) (percen.) (perccs:) 


Kemils 
(ptr cc^zj 


Protein 
(p-T cert) 



near to representing the relation shown by the individual obser/ations 
through most of its extent; but beyond 95 per cent of vitreous hemels 
it fails to follow^ the individual obser\’ations — througli that portion of the 
range the protein content rises much faster than is indicated by the average 
for the whole range from 75 through 100 per cent vitreous kernels. 

Because over half of all the obsers'ations fall in this upper portion of 



Vitreous kernels (per cent), X 


Fig. 6.1 1. Original observations and averages of protein contern., and freehand cunc 


lOS Simple Regression, Lineor ond Curvilinear 

the range, it would seem reasonable to classify them into smaller groups 
so as to give a better basis for detennming this portion of the curve Let 
us try splitting the observations above 50 into four group?^ each with 
about the same number of observations — say 50 to 69, 70 to 84, 85 to 94, 
and 95 to 100 The computation of the new averages is shown m Table 6 9 


Table £.9 

Computation of Subaverages for Last Groups in Wheat Problem, 
FOR FnriNO Freehand Curve 



These new averages, together with the previous ones for the lower 
groups, are also plotted in Figure 6 1 1 and the number of cases that each 
represents is indicated next to it, to aid in judging what weight to assign 
to that average Finally, a smooth continuous curve has been drawn in, 
to pass as near as possible to the different averages without making illogical 
twists or turns As is evident m the figure, it has been possible to draw 
the line with no point of inflection in it, yet so that it passes quite near to 
all the group averages and approximately through the middle of the 
individual observations Further, the general course of the line is suffi- 
ciently well defined by the several group averages so that if it were redrawn, 
either by the same person of another person, it could have only minor 
differences from the line actually shown Making the chart over two or 
three times, and drawing a separate curve on each tnal, then averaging the 
two or three curves together, is one method of reducing the variation due 
to individual judgment in drawing the curve 

Cautions in Freehand Fithno No attempt has been made to have 
the curve follow all the twists and turns of the irregular line of averages 
As was shown previously with the automobile illustration, irregular 
differences from group to group may be due to chance fluctuations m 
sampling where the groups arc small Not unless the groups included 
a very much larger number of cases than these do here would one be 
justified m bending the curve because of the position of a single group 
average, and not even then unless there was some logical basts for a curve 




SD7 


Simple Curvilinear Regression 

of that shape. In doubtful cases breaking up a particular eroup into 
smaller groups, as was just done in the wheat example, or r^las^ifwn<r 
the observations into somewhat different groups, vili help to determine 
whether or not the data positively indicate that an extra inflection is 
needed. It is also necessary to see if some single obserx'ation is responsible 
for the abnormality; if it is, it is better to disregard it and draw the cur\e 
without the extra twist. 

In drawing in a freehand curve, it is desirable to place certain logical 
limitations on the shape of the curve rather than to have it be purelv 
an empirical representation of the data. To do this, it is necessaiy to 
decide before the curve is drawn what those limitations should be. The 
limitations should be based upon a logical analysis of the relation under 
examination, in the light of all the information available to the investmator. 
In this case, for example, a consideration of the biological structure of 
the kernels, of the portions which run high m protein content, and of the 
appearance and size of those portions might lead one to the following 
conclusions: 

1. An increase in the proportion of vitreous kernels might be associated 
with no change in the proportion of protein, or with an increase in the 
proportion, but never with a decrease in the proportion. 

2. The relation between vitreous kernels and protein should be a pro- 
gressive one, consistently changing throughout the range of variation, 
rather than fluctuating back and forth. 

3. The maximum proportion of protein would be found with the largest 
proportion of vitreous kernels. 

These three logical expectations might then be expressed in the following 
limitations to be placed on the shape of the curve to be drawm- 

1. The curve should have no negative slope throughout its length. 

2. The curve should have no points of inflection, but should change 
shape continuously and progressively. 

3. The maximum should be reached at the end of the curve. 

These three logical limitations are all fulfilled by both the cun'cs shown 
in Figure 6.11, yet they would exclude other tjpes of curves which might 
be drawn. For example, they would rule out a cur/e with a hump or 
twist in it, or one which sloped down and then up.^® 

In some cases, examination of the data by the method of successive 
group averages, even after all the tests suggested above, will show .ne 

“ This use of logical analysis in stating the limitations on a freehand wrvi. rns) be 
compared with the use of logic in deciding on the type of mathematical cquat.on to 
employ. 



t08 Simple Regression. Linear and Curvilinear 

presence of a relation which cannot be expressed within the logical limita- 
tions imposed on the shape of the curve In that case, the reasoning 
underlying the logical analysis should be reexamined, to see if some step 
requires restatement and if the limitations themselves should be changed 
{For a further discussion of this interaction of induction and deduction, 
see pages 469-476 of Chapter 26) For a curve to have real meaning. 
It must be consistent with a careful logical analysis, no matter whether 
the curve is obtained mathematically or freehand, or whether the logical 
limitations are expressed in a mathematical equation or m a set of limita- 
tions placed on the shape of the curve drawn by freehand fitting 

Interpreting the Fitted Curve The use of the freehand curve m 
estimating values of the dependent variable, percentage of protein, from 
known values of the independent variable, proportion of vitreous kernel, 
may be readily illustrated Taking the first observation, with 6 per cent 
of vitreous kernels, and reading off the corresponding proportion of 
protein from the curve in Figure 6 11 . we get 1 0 4 per cent as the estimated 
protein content Similarly for the second observation, 75 per cent vitreous 
kernels, the curve indicates 12 9 per cent as the proportion of protein 
Reading off the estimated protein for each of the 20 observations we get 
the estimates shown m Table 6 10 

Even though in using the freehand curve we do not have an equation 
stating the relation between X and Y, we still have a mathematical expres- 
sion of the relation between them For we can write 
Y =fiX) 

which simply means that the estimates, or Y' values, are ^function of X, 
that IS, for every X value there is some corresponding Y' value Of course, 
we can find what this corresponding value is only by reading it off the 
curve, yet that is enough Wc have a graphic statement of the functional 
relation , if we had a definite formula to represent the curve, we would 
have an analytical statement of the relation as well 

Although we have not fitted a definite equation to represent the freehand 
curve, It IS still possible to state the relation shown by the curve other 
than in graphic form This can be done by constructing a table showing, 
for whatever values of the independent variable may be selected, the corre- 
sponding estimated values of the dependent variable Table 6 II illustrates 
this method of starting the relation 

Fora more detailed discussion of ihe pros and cons of freehand versus malhematical 
fitting see W Malenbaum and J D Black, The use of the short-cut graphic method of 
multiple correlation. Quarterly Journal of Econamtes, Vol LII, November, 1937, and 
The use of the short-cut graphic method of multiple correlation Comment by Louis 
Bean, Further comment by Mordecai Ezekiel, and Rejoinder and concluding remarks 
by Malenbaum and Black Quarterly Journal of Economics, February, 1940 



Simple Curvilinear Regression 


109 


Table 6.10 


Actual Per Cent of Protetn and Proportion Estimated on Bwiis of 

Freehand Curve 


Proportion of 
Vitreous Kernels, 
X 

Actual Proportion 
of Protein, 

r 

Proportion of 
Protein Estimated 
from Vitreous 
Kernels, 
r'=f(X) 

Difference Between 
Actual and Estimate. 
Y~ Y’ 

6 

10.3 

10.4 

-0.1 

75 

12.2 

12.9 

-0.7 

87 

14.5 

14.5 

0 

55 

11.1 

11.4 

-0.3 

34 

10.9 

10.7 

0.2 

98 

18.1 

17.4 

0.7 

91 

14.0 

15.2 

-1.2 

45 

10.8 

11. 1 

-0.3 

51 

11.4 

10.3 

1.1 

17 

11,0 

10.5 

0.5 

36 

10.2 

10.8 

-0.6 

97 

17.0 

17.0 

0 

74 

13.8 

12.8 

1.0 

24 

10.1 

10.6 

-0.5 

85 

14.4 

14.2 

0.2 

96 

15.8 

16.7 

-0.9 

92 

15.6 

15.5 

O.I 

94 

15.0 

15.9 

-0.9 

84 

13.3 

14.0 

-0.7 

99 

19.0 

18.0 

1.0 


In the range where the cur\'e is rising most steeply the readings are 
taken more closely together, to provide for reproducing that portion 
of the curv'e more accurately. In addition, no readings are taken beyond 
the range covered by the original observations. 

When to Fit a Mathematical Equation. Mathematical curves will 
have a distinct advantage over freehand methods when there is some 
good logical basis for expecting a certain type of relation to hold. When 
there is a logical basis for using a given formula, the constants of the 
equation ser\'e as an explanation of the real nature of the relationship. 
In all other cases the mathematical cur\'e is no more reliable than the 
freehand curs-e; the latter may therefore be employed to describe the 
nature of the relation, and can be determined with much less expenditure 
of effort. That does not mean that a mathematical curve, based on 



/ 10 S/mpte fiegressJon, Linear oncf Curvilinear 

Table 6 11 


Per Cent of Protein Corresponding to Various Proportions of Vitreous 
Kernels in Samples of Wheat, as Indicated by 20 Observations 


Proportion of 
Vitreous Kernels 

Corresponding 
Proportion of 
Protein 

Proportion of 
Vitreous Kernels 

Corresponding 
Proportion of 
Prolein 

Per cent 

Per cent 

Per cent 

Per cent 

10 

104 

70 

124 

20 

105 

80 

13 5 

30 

107 

90 

150 

40 

109 

95 

162 

50 

il 2 

99 

180 

60 

IJ 7 




adequate logical analysis, is of no additional value If it can be shown that 
such a curve does fit the data, that may venfy an hypothesis and so 
provide a “law’ to state the nature of the relationship, which may be of 
far more value than the mere empirical statement of what the relationship 
IS observed to be 

Where the logical expectations do not lead to a relation which can 
be formally expressed in a simple equation, they may, as has already been 
shown, still be sufficient to state a set of limiting conditions to be used 
in fitting a freehand curve However, if a mathematical curve is found 
empirically which fits the data about as well as a freehand curve could, 
confidence intervals can be calculated by straightforward methods (note 
Chapter 17) Where good computing equipment is available, the substilu* 
tion of machine time and clerical labor for the time of the researcher 
himself may tip the balance in favor of mathematical curve? 

A Mathematscal Equation Used in an Economic Problem. Economists 
sometimes find it convenient to classify commodities according to their 
elasticities of Remand with respect to income — that is, the ratio between 
the rate of change m the expenditure made by families for a commodity 
group as their incomes increase and the corresponding rate of change in 
their incomes In general, expenditures for necessities increase less than 
proportionately to income, and expenditures for luxury goods and services 
increase more than proportionately to income Although the elasticity of 
expenditure with respect to income for a particular commodity group 
may change from one income level to another, it is often desired to obtain 
an average elasticity over some specified range of incomes This is equi- 
valent to assuming that the elastiaty is constant over the range in question 

This economic relationship can be expressed m a mathematical equation 



Simple Curvilinear Regression 


III 


or model just as readily as the several physical h%potheses which have 
been discussed, for it makes certain definite assumptions as to the way the 
two variables (total income and expenditures on a given commodity) 
are related. 

If X is used for income per capita, and £ for expenditure for food per 
capita, the desired elasticity for food as a whole is given by the coefficient 
b in the equation 

log E = a + bXozX 

The application of this equation may be illustrated by concrete data. 
Table 6.12 shows data from family consumption studies in 15 selected 
areas of 13 countries. These have widely varying income levels per person. 
For each country or area, the average total consumption expenditures 
per person, and the average expenditures on food, are shown. (Total 
consumption expenditure is used here rather than total income, as in 
a number of countries the sur\'eys did not collect data on total income.) 
All values are adjusted to dollars of 1948 purchasing power, and are shown 
to 3 significant digits. Each study represents a large number of cases, 
from 180 consumption units for the smallest (Finland) to 22,705 for the 
largest sample (Japan). 

We now “fit” the equation log £ = c + h log A' to the data by the 
methods previously discussed for fitting the straight line. The computation 
of the extensions needed are also shown in Table 6.12. From the totals 
obtained there, we then compute the statistics for a and b as follows; 


2£ 31.8060 


A/, = — = 


2(x2) 


n 

15 ■“ 

ZX _ 

37.3915 _ 

n 

15 

Z{EX) 

Ziex) 

- = 

- nM^M: 
1.85158 


= 2.4928 


= 0.7124 


= Me- bM^ = 2.1204 - (0.7124){2.4928) = 0.3445 


and then 

£ = a + = 0.3445 + 0.7124je 


and also 


log £ = 0.3445 + 0.7124 log X. 

The size of the constant b. 0.71, indicates that on the average a 1 per 
cent increase in total consumption expenditure is accompanied by approxi- 
mately 0.7 per cent increase in expenditures for food — that is, in economic 



112 


Simple Regression, Linear and CurvUlnear 

Table 6 II 


Expenditures Per Capita on Consumption and on Food, 
AND Computation of Logarithmic Curve 
(log E = a + b log X) 



Average Expenditure Logarithms of Data 

Extensions 

Country 
or Area 

Total, 

X 

Food 

E 

Total, 

Jf 

Food, 

£ 


XE 

India 

Dollars 

52 9 

Dollars 
33 7 

I 7235 

1 5276 

2 97045 

2 63282 

Ceylon 

76 2 

48 1 

1 8820 

1 6822 

3 54192 

3 16590 

Gold Coast 

102 

58 4 

20086 

I 7664 

4 03447 

3 54799 

Japan 

143 

69 6 

2 1553 

1 8426 

4 64532 

3 97136 

Portugal 

(Porto) 

144 

903 

21584 

1 9557 

4 65869 

4 22118 

PoiVogat 

(Lisbon) 

291 

153 

2 4639 

21847 

6 07080 

5 38288 

Austria 

309 

151 

24900 

21790 

6 20010 

5 42571 

Ireland 

350 

135 

2 5441 

21303 

6 47244 

5 41970 

Fmland 

407 

173 

2 6096 

2 2381 

6 81001 

5 84055 

Panama 

439 

152 

2 6425 

21818 

6 98281 

5 76541 

Switzerland 

539 

172 

2 7316 

2 2355 

7 46164 

610649 

Sweden 

622 

215 

2 7938 

2 3324 

7 80532 

6 51626 

Canada (1948) 

919 

247 

2 9633 

2 3927 

8 7811S 

7 09029 

United States 

1295 

409 

31123 

26II7 

9 68641 

8 12839 

Canada (large 
cities, 1953) 

1296 

351 

31126 

2 5453 

9 68828 

7 92250 

Sums 



373915 

31 8060 

95 80981 

81 13743 


L Goreux, Long range projections of food consumption FAO Monthly Bulletin of 
Agriculture, Economics and Slatisncs,\o\ 6, No 6, pp 1-18 June, 1957 


terms, the “income elasticity” of food expenditures (with reference to 
total consumption expenditure) is 0 7 

Studies of the relation of food expenditure to family income within 
individual countries sometimes suggest that the income elasticity declines 
as income nses To test this hypothesis with respect to the data in Table 
6 12 we would have to fit a parabola of the form log £ = a + A log A" + 
c(log A^) We can form an opinion about this by plotting both the original 
observations and the logarithmic straight line just fitted Figure 6 12 
shows this comparison (on the left half of the figure) in terms of the 



simple Cumlinear Regression 


f/3 




I!4 Simple Regression Linear and Curvilinear 

logarithmic \alues used in the computation and with the logarithmic 
values of the function which is, of course, a straight line The straight 
line seems to fit the original values quite closely and it seems doubtful 
that a logarithmic parabola would improve the fit There is a slight 
indication that income elastiaty of food expenditures declines with 
increasing income 

A comparison may also be made in terms of the original arithmetic 
values, using the estimated values of the fitted line transformed from 
logarithms back to natural numbers The right hand section of Figure 
6 12 shows this comparison The fitted line is now seen to be a slight 
curve, with a reasonably close “fit” to the actual observations 

An alternative to ploUing the logarithms is to plot the variables in 
natural numbers on double log paper Use of the graphic method on 
double log paper, or of the values plotted m logarithms, facilitates study 
of changing elasticity along the demand curve 

The wheat-protem example, on the other hand, illustrated a case where 
there was no logical basis for the use of any particular equation and where 
a freehand curve was therefore as satisfactory as any other type and gave 
a belter fit than any of the analytical types which were tried Many of 
the problems in the natural and social sciences are of this type, where 
the relation can be measured even though the specific causes for it cannot 
be stated mathematically Only where the relations can be explained 
on some logical basis which lends itself to mathematical statement is 
there justification for a Urge amount of work to “fit” a specific formula, 
or where it is desired to determine the confidence intervals for the fitted 
curve 

Limitations in Estimating One Variable from 
Known Values of Another 

The methods shown so far provide a defimte technique by which an 
investigator can determine the way m which the values of one variable 
differ as the values of another related variable differ These same opera- 
tions afford a basis for estimating values of the dependent variable from 
given values of the independent variable, for cases in addition to those 
from which the functional relation was determined Whether such esti- 
mated values, for cases not included in the original study, can be expected 
to agree with the true values if they could be determined, depends upon 
two groups of considerations (1) the descriptive value of the curve, 
and (2) its representative reliability when it comes to applying it to new 
observations 

*• For methods of calculating such confidence intervals see Chapter 17 



Simple Curvilinear Regression //5 

These two groups of considerations apply (!) to exactly what a given 
curve means, with regard solely to the particular cases from which it 
was determined; and (2) the dependability of the cun'e with regard both 
to the ability of those observations to represent the universe (whole 
group of facts) from which they were drawn and the ability of the curve 
to represent the true relations existing in that universe. This second group 
involves an extension of the points which were raised in the first chapter 
as to the reliability of an average; discussion of these questions will 
be deferred until Chapters 17 and 19. 

Just as an average computed from a sample may differ more or less 
widely from the true average of the universe from which that sample 
was drawn, so a regression line or curve determined from a sample may 
differ more or less widely from the true regression in the universe. The 
following chapter discusses this problem, and Chapter 17 presents methods 
of estimating how far the regression line or curve from an individual 
sample may miss the true regression of the universe, and howthe confidence 
interval for its position may be determined.^® 

The reliability of a curve depends upon the number of observ'ations 
from which its position was determined and how closely the curve as deter- 
mined “fits” those observations. Since the number of observations usually 
differs along the different portions of a curve, it may be much more 
reliable in its central portions, where the bulk of observations occurs, 
than in the extreme portions where the number of obserx'ations may be 
much less. This may be especially marked in the case of complex curx’es 
fitted by mathematical means, where single extreme observ'ations may 
have a material effect upon the shape of the end portions. In any event, 
only 'those portions of the curve where there are enough observations to 
make its shape and position definite should be regarded as statistically 
determined; the end portions, when dependent upon a few observations, 
should either not be used at all or else stated as very rough indications 
of the true curve. 

It is particularly to be noted that determination of the line or curve of 
relationship gives no basis for estimating beyond the limits of the values 
of the independent variable actually observed. No matter whether a 
formula has been fitted or not, any attempt to make estimates beyond the 
range of the original data by “extrapolation,” i.e., by extending the curve 
beyond the range of the observ'ed data, gives a result that is not based on 
the statistical evidence. In case a formula has been used which has a 
good logical basis, extrapolation may give a result which it is logical 

” For the straight line, e.g., (5.1), both the level (a) and slope (6) may vary between 
samples drav^m from the same universe. The word “position is used here to designate 
the combined effects of variations in a and h. 



116 


Simple Regression, Linear ond CurW!/neor 

to expect — but its reasonableness rests on the validity of the logic rather 
than on a statistical basis The statistical analysis indicates only what the 
relations are within the range of the observations which are used m the 
analysis, and only within theconfidenccintervalforthe relation determined 
The “closeness” with which the line or curve fits the original data is 
another cntenon of the reliance which can be placed in it If the data 
all fall quite close to the line, that fact inspires more confidence in it than 
if they differ widely and erratically from it But there are special statistical 
measures of just what this ‘ closeness” is, and they will be given separate 
considerations m the next chapter 


Summary 

In some functional relations, the change in the dependent variable with 
changes in the independent variable cannot be represented by a straight 
line Such a relation may be represented by a curve showing the value of the 
dependent variable for each particular value of the independent variable 
Curves may be fitted to given sets of observations either by use of mathe* 
matical functions, such as parabolas, logarithmic curves, and hyperbolas, 
or by various processes of freehand smoothing When there is a good 
logical basis for the selection of a particular equation, the equation and the 
corresponding curve can provide a definite logical measurement of the 
nature of the relationship When no such logical basis can be developed, 
a curve fitted by a definite equation yields only an empirical statement 
of the relationship and may fail to show the true relation In such cases 
a curve fitted freehand by graphic methods, and conforming to logical 
limitations on its shape, may be even more valuable as a description of the 
facts of the relationship than a definite equation and corresponding curve 
selected empirically, but fitting less well 

In any event, estimates of the probable value of the dependent variable 
cannot be made with any degree of accuracy for values of the independent 
variable beyond the limits of the cases observed, and can be made most 
accurately only within the range where a considerable number of observa- 
tions is available It may be possible to extrapolate the curve if its equation 
IS based on a logical analysis of the relation as well as on the cases observed , 
but in that case the logical analysis, and not the statistical examination, 
must bear the responsibility for the validity of the procedure 


Note 6 I When an equation is used with the dependent variable stated as a 
logarithm as types (A) and (<•) on page 70, the further assumption is involved that the 
errors to be minimized vary proportionately with the size of the dependent variable 



117 


Simple Curvilinear Regression 

The standard error of estimate also nnist be stated as a percentage of the salue estimated, 
rather than as a natural number. For an example of a problem where the range of error 
increases with the size of the dependent satiable, and where a logarithmic equation 
would therefore be justified, see Rgure 8.2, page 142. 

REFERENCES 

Mills, Frederick C., The measurement of correlation and the proble.m of estimation. 
Quart. Pub., Amer. Stat. Assoc., Vol. XIX pp. 273-300, September, 192^. 

Ezekid, Mordecai, A method of handling curs-ilinear correlation for any number of 
variables. Quart. Pub. Amer. Stat. Assoc., VoL XIX pp. 431-453. December, 1924. 



CHAPTER 7 


Measuring accuracy of estimate 
and degree of correlation 


The methods developed up to this pomt may be used to estimate the 
values of one variable when the values of another are known or given 
They also furnish an explicit statement of the average difference or change 
m the values of the estimated or dependent variable for each particular 
difference or change in the value of the known or independent variable 
But that IS not enough In addition it is frequently desirable to answer 
three queries (1) How closely can values of the dependent variable be 
estimated from the values of the independent variable (2) How important 
IS the relation of the dependent variable to the independent variable‘s 
(3) How far are the regression curve and these relations, as shown by the 
particular sample, likely to depart from the true values for the universe 
from which the sample was drawn‘s Special statistical devices, termed 
(1) the standard error of estimate and (2) the coefficient and index of 
correlation, have been developed to meet the need indicated by the first two 
questions Error formulas and knowledge of the distributions of these 
coefficients, and standard errors for the regression line or curve, provide 
approximate answers for the thud, under the assumption that certain 
conditions of sampling are met 


The Closeness of Estimate — Standard Error of Estimate 

Attention has previously been called to the fact that when some 
dependent variable, such as the distance required for an automobile to 
stop after the brake is applied or the protein content in wheat samples, 
IS estimated from another variable, such as the speed at which the car 
is moving or the proportion of vitreous kernels in the sample, the estimated 
values m many cases will not be the same as the values of the dependent 
118 



Measuring Accuracy of Estimate and Degree of Correlation f 19 

variable that were originally observed. These differences are obviously 
due to residual causes; that is, to variations in the dependent variable 
which were unrelated to changes in the particular independent variable 
used in the analysis. For that reason the differences between the estimated 
values and the actual values are termed residual differences or, more 
simply, residuals. 

f'or Linear Relations. The meaning of the residuals and their use in 
determining the standard error of estimate and the coefficient and index 
of correlation can best be understood if illustrated by a concrete case. 
Such an illustration is given in Table 7.1. Each of the 22 observations 
relates to a subregion of several counties drawn at random from a larger 
number of subregions in the North Central States. One variable {X) 
measures the degree of industrialization in terms of the per cent of all 
employed persons in the subregion as of 1940 who were engaged in 
manufacturing. The other variable ( Y) measures the effect of migration 
during 1940-1950 upon the population of the subregion. During this 
decade employment in manufacturing expanded rapidly and many people 
migrated from rural areas to industrial centers. 

In Table 7.1 the observations have been fitted by a straight line to 
estimate net migration on the basis of manufacturing employment The 
estimated net migration figures, Y', and the residuals, z, or differences 
between the estimate and the actual, are also shown. 

The residuals vary from +13.33 to — 14.79. If we wish to say how large 
they are on the average, we can ignore the plus and minus signs and 
compute the average deviation. For the 22 residuals in Table 7.1, the 
average deviation is 4.08, and the standard deviation is 5.60. If these 
residuals are grouped in a frequency distribution, they fall as shown in 
Table 7.2. 

The standard deviation of z is different from the standard deviations 
previously computed. Instead of showing the standard deviation of net 
migration from its mean (that is, ^j.), it shows the standard deviation around 
a changing quantity, depending on the per cent of empIojTnent in manu- 
facturing. The s^ is thus the standard deviation around the fitted line 
of relation for the sample we have analy'zed. and may be indicated graphi- 
cally on a dot chart as a certain distance above and below the fitted line 
(note Figure 8.1, in the next chapter). 

The standard deviation of the residuals is 5.60, so we should expect 
two-thirds of the residuals to come between +5.60 and — 5.60. Of the 
22 cases. 16 came within this range of the line, or 73 per cent of all the 
cases. Similarly, only 5 per cent of the cases would be expected to fall 
outside the range +25., or below — 11.20 or above +11.20. Actually 
two obseivations, or 9 per cent of the cases, fall outside this range. These 



no 


simple Regression, Linear and Curvilinear 


Table 7.1 


Employment in MANUFACTURrNO, 1940, Net Migration, 1940-1950, and 
Net Migration Estimated moM Employment in AfANUFAcruRiNO, 

IN Selected Subregions 


Employment in 
Manufacturing 

X 

Net 

Migration, 

Y 

Estimated 

Net 

Migration* 

Y 

Excess of 
Actual over 
Estimate 

Per cenlt 

Per centX 

Per cent i 

Per cenll 

28 1 

-69 

1 41 

-8 31 

5 1 

-292 

-1441 

-14 79 

59 

-IS 

-13 86 

6 36 

29 4 

79 

2 30 

560 

41 9 

83 

1090 

-2 60 

187 

-5 8 

-5 06 

-0 74 

93 

-105 

-11 52 

102 

187 

10 

-5 06 

606 

24 1 

19 

-I 35 

325 

17S 

-9 3 

-5 88 

-3 42 

180 

-39 

-5 54 

1 64 

90 

-123 

-11 73 

-0 57 

87 

-161 

-1194 

-4 16 

144 

-130 

-8 02 

-4 98 

47 

-133 

-14 69 

1 39 

76 

-79 

-12 69 

479 

4 1 

-156 

-15 10 

-0 50 

1 7 

-199 

-1675 

-3 IS 

27 

-149 

-1606 

1 16 

103 

25 

-1083 

13 33 

33 

-17 3 

-15 65 

-1 65 

40 

-149 

-15 17 

0 27 


Source of data Paul J Jehlik and Ray E WaVeley, Population change and net 
migration in the North Central Stales 1940-50, to»a Agricultural Experiment Station, 
Research Bulletin AJO July 1955 
• Computed by regression formula r=s — !79I8 + 0 6877 A" 
t Percent of all employed persons in the subregion as of 1940 who were enga^ m 
manufacturing 

J Per cent change m population of the subregion through migration, 1940-1950 
§ Same units of measure as for V 






Measuring Accuracy of Estimate and Degree of Correlation 


III 


Table 7.2 

Frequency Distribution of Residuals in Estimating Net Migration 


Residual 

Number of 

Times 

Occurring 

Residual 

Number of 
Times 
Occurring 

-14.99 to -10.00 

1 

0 to 4.99 

7 

-9.99 to -5.00 

1 

5.00 to 9.99 

3 

—4.99 to 0 

9 

10.00 to 14.99 

1 


are sufficiently close to the expected proportions for a normal distribution 
with this limited number of observations. 

The symbol S is used to denote the standard error of estimate. Sj.. 
indicates the standard error for estimates of V made from a linear relation 
to X by the equation 7=0 + bX. Similarly, would indicate the 
standard error for estimates of 7 made on the basis of a freehand cur\'e 
relation to X, as indicated by the equation 7 = f(X). 

The standard error of estimate is therefore defined by the two equations; 

52 ^= 5 ; = ^ (7.1) 


C2 


S{=T- 


(7.2) 


The standard error of estimate in estimating net migration from employ- 
ment in manufacturing, by the linear equation, is therefore 5.60 per cent. 

For Curvilinear Relations. \Sffiere a curvilinear regression is repre- 
sented by a fitted algebraic equation, as with the cubic parabola fitted to the 
wheat-protein example in the preceding chapter, the standard error of 
estimate can be calculated from the residuals between the actual values of 
7and the estimated values, 7", based on the fitted equation. (Methods of 
calculating without actually computing each individual value of 7" and 
z' are given later, in Chapters 12 and 13.) 

The calculation of the standard error of estimate for a freehand 
regression curve may be illustrated by the migration data. From a freehand 
curve, fitted fay methods already described, estimates of}' from the relation 
Y = f (X) were obtained, as shoipvn in Table 7.3. 

The standard deviation of the new residuals is 5.49. This is then the 
standard error of estimate for estimates based on the curve. 


122 


Simpfe Regression Linear end Curvilineor 


Table 7 3 


Employment in Manufacturing Net Migration and Net Migration 
Estimated from Employment in Manufacturing, by Freehand Curve 


Employment in 
Manufacturing 

X 

ttet 

Migration 

Y 

Estimated 

Net 

Migration 

y 

Excess of 
Actual over 
Estimate 

Per cent 

Per cent 

Per cent 

Per cent 

28 1 

-69 

1 9 

-8 8 

5 1 

-29 2 

-148 

-144 

59 

-7 5 

-140 

65 

29 4 

79 

24 

55 

41 9 

83 

72 

1 I 

18 7 

-58 

-32 

-26 

93 

-105 

-105 

0 

18 7 

10 

-3 2 

42 

24 1 

1 9 

0 

1 9 

175 

-9 3 

-4 2 

-5 1 

180 

-3 9 

-3 9 

0 

90 

-12 3 

-11 0 

-1 3 

57 

-161 

-n 3 

-4 8 

144 

-130 

-64 

-6 6 

47 

-133 

-153 

20 

76 

-79 

-122 

43 

41 

-156 

-157 

01 

1 7 

-199 

-J93 

-06 

27 

-149 

-176 

27 

10 3 

25 

-9 8 

123 

33 

-173 

-170 

-0 3 

40 

-149 

-160 

1 1 


The standard error of estimate of 5 49 from the curve, compared with 
that of 5 60 from the straight line, indicates that in both cases the net 
migration from a subregion can be estimated, for the cases included m 
the sample, from the per cent of employment m manufacturing with a 
standard deviation of about 5| percentage points It appears at this 
stage that the estimates made on the basis of the curvilinear relation are 
only a little more reliable than those based on the linear relation 
Where the same set of conditions prevails as those under which the 
original data were selected and only the independent variable is known. 
It may be desired to estimate the probable value of the dependent variable 



123 


Measuring Accuracy of Estimate and Degree of Correlation 

from the known value of the independent. Thus if the 19-0 percentaees 
of employment in other subregions are knov/n, it mav be desired to 
estimate the probable extent of net migration during 'l9-t0-1950- To 
actually measure net migration for these other subresions would require 
detailed and laborious calculations. Or in a case where sield of cotton 
with various applications of irrigation water has been determined (note 
the example in the next chapter) it may be desired to estimate the most 
probable yield on other fields, solely from the amount of water applied. 
In case the estimates were to be made for new observations taken from 
the same universe — for example, on the same soil type, in the same area, 
and for the same year — as were the previous samples, a knowledae of the 
standard deviation of the residuals for original samples gives a basis 
for judging how closely the new estimates are Ukelv to approximate the 
true, but unknown, yields for the new observ'ations. Similarly, in the 
net migration case it is eiident that the errors of estimate will not often 
be greater than 1 1.20 per cent and usually' will be less than 5.60 per cent. 

Because the standard deviation of the residuals mav thus serve as a 
basis for indicating the closeness with which new estimated values may 
be expected to approximate the true but unknow-n values, it has been 
named the standard error of estimate} 

The standard error of estimate can be used to indicate the probable 
reliability of a series of estimates of the values of the dependent variable 
for new observations when only the values of the independent variable 
are know'n, but only where it is definitely Imown that the new cases are 
drawn at random from exactly the same universe — the same set of con- 
ditions — as were the obsenations from which the relation was determined. 
In case they do not represent exactly the same conditions — as if, for 
example, they represent a different period of time^ — then the standard 
error of estimate has meaning only with respect to the scatter of the 
residuals around the regression line for the cases used in determining the 
relationship. It measures (when adjusted) what the differences probably 
would have been in the universe from which the observations came but 
does not give more than a clue or a possible indication as to what the 
differences may be when the same relations are applied to data obtained 
under new or different conditions. 

Adjustment of Standard Error of Estimate for the Number of 
Observations. The standard de\'iations of a series of samples drawn 
from any' stable universe will varv from one to another, owing to statistical 

' Chapter 19 gives more refined measures of the accuracy with which estimates may 
be made for indiridual new observations. 

= See Chapters 2 and 17 for other conditions assumed before error formulas apply 
e.xactly. 



124 


Simple Regression, Linear and Curvilinear 


fluctuations The same is true for the standard error of estimate com- 
puted for a fitted line The standard deviations and standard errors of 
estimate not only vary but on the average also are slightly smaller 
than would be obtained from an extremely large sample from the 
same universe Because of this tendency of the standard error of estimate 
from the sample to understate the standard error in the universe, an 
adjustment is necessary before it can be used other than for the sample 
An unbiased estimate of the value of the square of the standard error of 
estimate for the entire universe may be calculated from the standard error 
of estimate for the sample by the use of the following equations* 



And for curvilinear functions 


hence 




njj 

n — m 


_ 


rt — m 



« — ni 



(7 5) 
(7 6) 


In these equations, is used to indicate the estimated squared standard 
error of estimate for the universe, just as i was used (in Chapter 2) to 
indicate the estimated standard deviation in the universe from which the 
sample was drawn 

In equations (7 3) to (7 6), « stands for the number of observations 
In all four equations, m stands for the number of constants in the regression 
equation, such as a, b, and c In the case of a parabola of the second order 
(type a. Chapter 6), m would be 3, for a cubic parabola (type /), it would 
be 4 Where a freehand curve has been used, it is necessary to estimate 
how many constants would be needed to represent the curve mathemati- 
cally (Seepages 70-72 for examples of the constants needed to represent 
vanous shapes of curves ) 

The standard error of estimate in estimating net migration by tfte 
linear equation, after the standard deviation of the residuals is adjusted 
by equation (7 3), works out to be 



22(5 60)2 
22-2 


= 34 50 



Measuring Accuracy of Estimate and Degree of Correlation 


12S 


The new value indicates that the errors in estimating net migration from 
employment in manufacturing, when the estimate is made for new obser- 
vations drawn at random from the same universe, v.’ill run sliahtlv larser 
than was indicated by the residuals for the cases included in the studN'. 
as tabulated in Table 7.1. 

When the standard deviation for the cunilinear function is calculated 
by equation (7.5), a different result appears. If it is assumed that the 
regression curve used could have been represented mathematically by 
an equation with three constants (such as a parabola) then the correction 
works out to be; 




22(5.49)2 

22-3 

5.91 


34.90 


The adjusted standard error of estimate for the curvilinear relation, 
5.91, is slightly larger than that for the linear equation, 5.87. This 
indicates that when estimates are made for new observations from the 
same universe, the straight line is likely to give fully as reliable results as 
is the regression curve. Not unless the adjusted standard error for the 
curve is materially smaller than for the straight line can the curvilinear 
regression be expected to improve the accuracy of estimate.® 

Units of Statement for Standard Error of Estimate. The standard 
error of estimate is necessarily stated in exactly the same kind of units 
as the original dependent variable. 3\Tiere the dependent variable is 
stated in feet, as in the automobile problem, the standard error of estimate 
will be in feet; where it is in percentage points, as in the wheat problem, 
the standard error will be in percentage points; and where it is in loga- 
rithms, as in Table 6.11, the standard error will be in logarithms. Thus in 
a case like that shown there, the standard error might be the logarithm 
0.038. That would mean that the logarithm of the estimates is likely to 
agree with the logarithm of the true values to within ±0.038, two-thirds of 

’ The values of 5,.. are subject to errors of sampling, just as the values of s, are 
subject to errors of sampling. Accordingly, the values of S, - must be regarded only as 
estimates of the true values, «r., which prevail in the universe from which the sample is 
drawn. Also, it must be remembered that the adjustment, m, for the number of degrees 
of freedom removed, is only an approximate adjustment in the case of a freehand curve, 
and that this introduces a further limitation to the accuraev’ of Even for rebtions 

estimated from an algebraic equation, the accuraqr' of estimates for new observations 
will vary somewhat from one observation to another, depending on how unusual is the 
value of the independent variable. Sec Chapter 19 for fuller discussion. 



126 Simpte Regression, Linear and Curvilinear 

the time With an estimated loganthm of 1 00, the logarithm of the true 
value would then be between 0962 and 1 038, two-thirds of the time 
In terms of antiloganthms, this gives values of 9 16 and 10 91, or between 
9 1 per cent above and 8 4 per cent below the value 10 

The standard error of estimate is thus computed from the standard 
deviation of the residuals for the cases on which the relation is based 
It indicates the closeness with which values of the dependent variable 
may be estimated from values of the independent variable Its exact 
interpretation differs with the particular units m which the values of the 
dependent variable are expressed 

The Relative Importance of the Relationship — Correlation 

In certain problems it might be found that every bit of variation in one 
variable could be explained, or accounted for, by associated differences 
m the value of an accompanying variable Thus all the variation in the 
volume of a cube can be explained by the corresponding difference in 
the length of one side No other variable is needed to account for the 
volume of the cube If we know what the length of the side is, we can 
compute accurately what the volume will be All the variation in volume 
can therefore be said to be explained, or accounted for, by the known 
relation to the length of the side 

In most problems with which the statistician has to deal, however, all 
the variation cannot be explained by the relation to another variable, 
and residual variation is left over As has just been pointed out, this 
residual variation can be measured and used as an indication of the errors 
of estimate 

It IS obvious that if no relation has been found, the independent variable 
considered does not explain any of the observed variation in the dependent 
variable, and so none of the variation can be explained as due to, or 
associated with, the independent variable If, as in the case of the cube, 
the estimates all agree exactly with the actual values, there are no residual 
elements, and the variation is perfectly explained But between these 
two extremes lie the cases of partial explanation, where a portion of the 
variation can be explained by the independent variable considered, and 
a portion cannot In the automobile case, part of the variation m stopping 
distance, but not all, was associated with the speed, in the wheat case, 
part of the variation m protein content, but not all, could be estimated 
from variations m the proportion of vitreous kernels , and m the migration 
case, part of the variation in net migration, but not all, could be accounted 
for by variations in manufacturing employment In many problems it is 
of interest to determine what proportion of the variation in the dependent 



Measuring Accuracy of Estimate and Degree of Correlation f27 

\ari 3 ble can be explained bj the particular independent variable considered, 
according to the relation observed. 

h'leasurement of the relative importance of the relation between two 
variables calls for a different type of statistical constant than the standard 
error of estimate. Xhe standard error of estimate simply indicates the 
size of the residuals without regard to the amount of variation in the 
dependent variable as first obsert'ed. If the standard error of estimate 
for a cotton-yield problem, for example, were 50 pounds, that would be 
the standard error no matter whether the yield of cotton in the original 
cases varied only between 200 and 400 pounds or between 50 and C 2 OO. 
If the yields varied only between 200 and 400 pounds, and the standard 
error was 50, practically all the variation in the original yields would 
still be left in the residuals; w'hereas if the yields varied between 200 and 
1,200 and the standard error was 50, only a very small portion of the 
original variation would be left in the residuals. Yet the standard error 
of estimate w'ould be of the same size in both cases. 

What is needed to indicate the relative importance of the relationship 
is some measure that shows what proportion of the original variation has 
been accounted for. The regression line separates each original value of 
the dependent variable into ttvo parts, an estimated value {Y') and a 
residual (z). If the original variation in Y is measured by its standard 
deviation squared (its “variance” as defined on page 10), and the un- 
explained variation by jf, then the difference sp — sj is a logical measure 
of the amount of variation accounted for by the regression line. 

As it turns out, this difference is exactly equal to the variance of the 
Y' values estimated from the regression line. Thus, in the migration 
example, sj = 81.23, sf = 31.36, and sp, calculated from the 7' values 
in Table 7.1, is 49.91. If we determine how large jp is compared to the 
original variance, w'e get sp/sj = 49.91/81.23, or 0.614. This is the pro- 
portion of variation in 7 accounted for by X according to the mathe- 
matically fitted straight fine relationship. The square root of this 
proportion, is termed the coefficient of correlation. 

Linear Relations — Coefficients of Correlation and Determination. 
The symbol r is used to represent the coefficient of correlation where the 
relationship between the two variables is found or assumed to be a straight 
line. When values of 7 are estimated from values of X according to the 
straieht-fine equation, the coefficient of correlation is indicated by the 
notation which is read “the coefficient of correlation between 7and X. 

The coefficient of correlation may therefore be defined 



(7.7) 



128 


Simple Regression, Linear and Curvilinear 

In this particular case, = 7 06/9 01 = 0 784 

This formula gives values of r identical with those given by the more 
usual formula, equation (8 3), presented in the next chapter, as can be 
pro\ed by simple algebra * 

Curvilinear Relations — Index of Correlation. In case the relation 
has been determined as a curvilinear function instead of a straight line, 
the ratio Sy-js, is termed the mdex of correlation, and is represented by the 
symbol 

The index of correlation may therefore be approximately defined as 



Computing the index of correlation for the migration case, = 
7 15/9 01 = 0 794 From this figure, it would appear that the correlation 
IS slightly higher for the curve than for the straight line 

Choroeter/it/ej of the Measures of Correlation. It should be noted 
that in the case of straight line relations, if the line has a positive slope, 
so that as X increases the values of Y' (the estimated values of Y) increase, 
the correlation is said to be posiiae, and a plus sign is affixed to the corre> 
lation coefficient Similarly, if the line has a negative slope, so that as 
the values of X (the independent variable) are larger, the values of J" 
(the estimated values for the dependent variable) become smaller, the 
correlation is said to be negative, and a minus sign is affixed to the correla- 
tion coefficient The coefficient of correlation thus takes the same sign 
as the constant b of the corresponding linear equation In the case of the 
correlation mdex, the slope may be positive in one portion and negative 
in another, so no sign is used, and reference to the curve is necessary to 
indicate the nature of the relationship 
In a case where the observed relation explains all the variation in the 
dependent variable, the estimated values will be identical with the actual 
values The standard deviation of Y' will therefore be exactly as large 
as the standard deviation of Y, and the ratio will equal 1 0 This 
is termed perfect correlation, and is indicated when i = 1 0, or when 
r = +1 Oor -1 0 

M cV.KTTA of TiO sifaViOw, 7*0 -vaTfii’iYcm can 'vt acccfcinfci fcfi 

by the particular independent variable considered, and the estimated 
values Y' are therefore all the same, being merely the average of Y In 
that case the standard deviation of the estimated values is zero, and the 
ratio /s„ = 0/jy = 0 The case of complete absence of correlation, 
therefore, is indicated by values of zero for either r or i 

* The correlation observed in a sample is designated r, and p (Greek rho) is used to 
represent the true correlation in the universe from which the sample was drawn 



129 


Measuring Accuracy of Estimate and Degree of Correlation 

The possible values of the coefficient of correlation therefore ransc 

fromOto +1.0 or to -1.0; whereas the values for the index of correlation 

range from 0 to 1.0. Since most problems with which the investiaator 
has to deal involve cases that are intermediate, where there is some but 
not perfect correlation, it is these intermediate cases which are of most 
importance. The precise significance of different values of r and i will 
be considered next. 

The correlation coefficient was originally defined in terms of the special 
situation in which Y and X each followed a normal distribution (see 
Chapter 1, Figure 1.1), and the universe of all possible paired values of 
Y and X formed what is called a “bivariate normal distribution.” If a 
completely random sample were drawn from such a universe, the coeffi- 
cient r calculated for that sample could be regarded as an estimate of the 
true correlation, p, existing in the universe. Similarly, the square of the 
sample coefficient of correlation could be regarded as an estimate of the 
proportion of variation in Y associated with variations in X in the 
universe. 

Precisely the same arithmetic is involved in calculating r for pairs of 
observations in which the values of X have been chosen in some non- 
random fashion, such as in a controlled experiment. But the value of r 
is now strongly influenced by the way in which the X values are selected, 
being high if only extremely large and extremely small values of X are 
chosen and low if the chosen values of X are concentrated in a narrow 
range. (See fuller explanation in Chapter 17.) In such cases r is little 
more than a description of one aspect of the particular set of obsciv'ations 
under study. It will be fairly stable from one experiment to another only 
if the values of X are selected in the same way in ever)' experiment. In 
controlled experiments, therefore, some statisticians do not even bother 
to calculate r, but content themselves with the line of relationship, the 
standard error of estimate, and measures of the accuracy with which the 
level and slope of the regression line have been estimated. 

There are other situations in which the values of X are not controlled 
by the investigator but in which they can hardly be regarded as random 
drawings from a definite, stable universe. This may often be the case in 
economic time series; the correlation between a given series and, say, 
consumer income will typically be higher in a period characterized b) 
wide fluctuations in consumer income than in one during which income 
is relatively stable. As in the experimental case above, r is not likely to 
be stable from period to period unless the economic system happens to 
generate the same values of income as before or at least about the same 
total amount of variation in income. 

Finally, the observations on X and Y may be drawn at random from 



120 Simple Regression, Linear and Cumlinear 

some definite, stable universe that does not follow the bivariate normal 
distribution In some such cases it may happen that certain functions of 
A and Y (such as their logarithms or reciprocals) do form a bivanate 
normal distribution, if so, the coefliaent r based on appropriately trans- 
formed sample values may be regarded as an estimate of the true correla- 
tion p, in the universe of transformed values There are other cases in 
which the ongmal values of A and y, or transformed ones, are distnbuted 
in something other than “normal” fashion, in these cases, r may still 
be a rough estimate of correlation in the universe, but we can no longer 
be sure that certain formulas appropriate to the normal distribution 
still apply 

In contrast, the interpretation of the regression coefficient is the 
same regardless of whether the A' values are drawn at random or are 
subjected to purposeful selection For this reason many statisticians 
now place pnmaiy emphasis upon regression and stress correlation only 
when both X and Y values arc drawn at random from a universe approxi- 
mating the bivariate normal form In the latter case the regression of X 
upon Y may under certain conditions be just as meaningful as that of 

Y upon X, this IS clearly not true if the values of X have been set by the 
researcher 

Where both X and Y are assumed to be buiH up of simple elements 
of equal variability, all of which are present m Y but some of which are 
lacking m X, it can be proved mathematically that measures the 
proportion of all the elements m Y which are also present m X For that 
reason in cases where the dependent variable is known to be causally 
related to the independent variable, r* may be called the coefficient of 
determination It may be said to measure the percentage to which the 
variance in Y is determined by X, since it measures that proportion of 
all the elements of vanance in Y which are also present in AT ^ The 
coefficient of determination, may be defined by the equation 

= 4 ( 79 ) 

Where some elements are present in each variable which occur in the other, 
\Vrt CDeSTitaerrt rf determmation is Vne product ol these joint proportions 
That is, if I of the elements in X arc the same as § of the elements in 1', 
then the coefficient of determination wiU be equal to t 

Although the coefficient of correlation was the earliest measure used. 
It can be seen that it may be misinterpreted Thus if half the vanance in 

Y were directly due to AT, the coefficient of correlation would be 0 707 
(== V^i) Since the coefficient of determination is the most direct and 

* See Note 1, Appendix 3 



Measuring Accuraqr of Estimate and Degree of Correlation 13 j 

unequivocal ^yay of stating the proportion of the variance in the dependent 
factor which is associated with the independent factor, it should be used 
in preference to the correlation coefficient. 

Where curvilinear relations have been used in determining the relation- 
ship, the term index of determination will be used to denote the value of 
i~, thus retaining the same relation to the index of correlation that the 
coefficient of determination bears to r, the coefficient of correlation. The 
index of determination, may be defined 

(7.10) 

When an expression is used such as “Forty per cent of the variance in 
yield is due to differences in rainfall," it will be understood that it is 
either the coefficient or the index of determination which is being stated. 

Relation of the Measures of Correlation to the Two Regression- 
Lines. Attention has been called in se\-eral previous chapters to the fact 
that tAvo regression lines can be fitted to any set of obsenntions. These 
are denoted by the two coefficients and b^j in the two equations 

and 

Although there are these tw’o regression lines, there is only a single 
coefficient of correlation for any one set of observations. In fact, the 
coefficient of correlation has certain definite relations to the two lines. 
It indicates how closely the two lines approach one another. The higher 
the correlation, the closer the two lines come together; the lower the 
correlation, the farther they diverge. In perfect correlation (r — ±1) 
the tAVO lines coincide. When there is no correlation (r = 0) the tvs'o lines 
Avill be at right angles to one another. 

This relationship is so exact that the value of the correlation coefficient 
can be computed from the slopes of the two lines according to the equation 

= (7.11) 

It follows from this equation that when r = b„ — Ifb^,., and therefore 
the tw'o regression lines will coincide.® 

‘ This property of the two lines can be used to estimate graphically the closeness of 
correlation. When the two variables, X and are stated in terms of standard 
deviation units, Xjs^ and T/s,, by dividing each observation by the standard deviation of 
the series, the coefficient of correlation uil! then be a precise mathematical functioa o 
the angle between the two lines. By stating the variables in this way, plotting thern on a 
dot chart, and drawing in the two lines graphically, a fairly close approximation to 
the coefficient can be obtained. 



132 Simple Regression, Linear and Curvilinear 

Although there can be only a single coefficient of correlation for a 
single set of observations, there can be two indexes of correlation This 
follows from the fact that the curve which expresses the relation of Y 
to A", 

r=fiXi 

may be a curve of quite a different type from that which expresses the 
relation of X to Y, which we can designate 

X=^Y) 

Accordingly the index of correlation which measures the closeness 
of correlation according to the first curve, may be quite different from 
the index of correlation which measures the closeness according to 
the second curve Only m the special case where all the observations lie 
precisely along the curve, so that i = I, will the two indexes have the same 
value In that case it will also hold true that the curves Y = j{X) and 
A' =» ^( T) will be identical with the coordinates reversed 

There is only one correlation coefficient r, however It measures the 
correlation according to both regression lines Since r ^ = 

the subscript notations can be used interchangeably 

Adjustments for Number of Observations Just as the standard 
error of estimate from the sample tends to be biased downward, as 
compared to the value that is most hkely to prevail in the universe, so 
the correlation coefficient or index from a small sample is likely to be 
biased upward This will be important, however, only in those special 
cases where there is some valid basis for using the sample to make generali- 
zations concerning the probable closeness of correlation in the universe 
Adjustments to use in such cases, and cautions in their use, are presented 
in the second portion of Chapter 17 

The Sompfmg 5/gni/7e<ince of the Regression Line or Curve and of 
the Meosures of Correlation Chapter 2 showed how a series of samples 
drawn from the same universe would yield varying estimates of the true 
average in that universe It also presented methods of estimating how far 
the average from a single sample might miss the true average m the 
universe In exactly the same way, if regression lines or curves are 
determined for a senes of samples from the same universe, they will vary 
among themselves Similarly, the coefficients or indexes of correlation and 
the standard errors of estimate will vary from sample to sample Standard 
errors of some of these measures are available These measures of relia- 
bility are more comphcated, both m computation and in interpretation, 
than the standard error of an average Acx:ordingly, their presentation is 
deferred to Chapter 17 In addition, the special problem of the reliability of 



tAeosumg Accuracy of Estimate and Degree of Correlation {33 

an individual estimate for an individual new obser\'ation, from the results 
shown by a sample, is treated in Chapter 19. The methods given in the 
present chapter and in Chapter 8 are sufficient for determininc the correlation 
and regression as shown in the individual sample. Before a student or 
research worker uses the results of the sample to draw more general 
conclusions as to the relations which are likely to hold true in other 
samples or in the universe as a whole, or before he makes estimates for 
new observations, he should master these later chapters and should appiv 
the checks and limitations set forth there in stating his general conclusions 
or in making his estimates. 

Summary 

This chapter has pointed out that the closeness of relation between 
two variables may be measured either by the absolute closeness with 
which values of one may be estimated from known values of the other 
or on the basis of the proportion of the variance in one which can be 
explained by, or estimated from, the accompanying values of the other. 
The accuracy of estimate is measured by the standard error of estimate, 
which indicates the reliability of values of the dependent variable estimated 
from observed values of the independent vanable. 

The relative closeness of the relation is best measured by the coefficient 
of determination, in the case of linear relationship, or by the index of 
determination, in the case of curvilinear relationship. These measures 
show the proportion of the variance in the dependent variable which is 
associated with differences in the other variable. In the case of variables 
causally related, they measure the proportion of the \’ariance in one which 
can be said to be “caused by” variations in the other. 



CHAPTER 8 


Practical methods for ivorking 
out tivo-variable correlation 
and regression problems 


Terms to Be Used. The preceding discussion has developed the means 
by which values of one variable may be estimated from the values of 
another, according to the functional relation shown in a set of paired 
observations Simple correlation involves only the means for making 
such estimates, and for measuring how closely those estimates conform 
to, and account for, the original vanation m the variable which is being 
estimated, for the given set of observations 
The regression line is used, m statistical terminology, to designate the 
straight line used to estimate one variable from another by means of the 
equation 

Y^a + bX 

This equation is termed the /f/Te<7/Te^«-xr*7n equation, and the coefficient 
b, which shows how many units (or fractional parts) Y changes for each 
unit change m X, is termed the coefficient of regression 
Where a curviLnear function has been determined, either by the use 
of an equation or by graphic methods, the corresponding curve is similarly 
designated as the regression curie Either the mathematical equation or, 
if none has been computed, the expression 
Y=f(X) 

where the symbol /(AT) stands for the relation shown by the graphic curve, 
IS termed the regression equation 

The coefficient of correlation and the index of correlation have both 
been defined as the ratio of the standard deviation of the estimated values 
of y to the standard deviation of the actual values, whereas the standard 
error of estimate has been defined as the standard deviation of the residuals 


134 



Practical Methods for Simple Regression and Correlation /35 

from the estimstes so msde. Xn linear relations, however, the coefiicicnt 
of correlation and the standard error of estimate can both be computed 
directly from the same values that w'ere employed in computins the con- 
stants of the regression equation, and from the standard delation of the 
dependent variable, This will be illustrated by the practical example 
which follows. 

Working Out a Linear Regression. As was illustrated in Chapter 5, 
the values for a and b of the regression equation can be determined for 
any two variables, X and Y, between which it may be desired to determine 
the relation, by working out the values, M^, M^, 'LX^ and 2(Xy), and 
then substituting them in the appropriate equations. To calculate the 
coefficient of correlation, r^, and the standard error of estimate, 
it is necessary to compute in addition only the value S T^ and substitute 

Table 8.1 


Computing the Values Needed to Determine Linear Regression and 
Correlation Coeiticients 


Irrigation Water 
Applied per Acre,* 
X 

Yield of Pima 
Cotton per Acre,’ 
Y 

¥ 

X^ 

XY 

r- 

Feet 

Units of 10 pounds 



1.8 

26 

3.24 

46.8 

676 

1.9 

37 

3.61 

70.3 

1,369 

2.5 

45 

6.25 

112.5 

2,025 

1.4 

16 

1.96 

22.4 

256 

1.3 

9 

1.69 

11.7 

81 

2.1 

44 

4.41 

92.4 

1,936 

2.3 

38 

5.29 

87.4 

1,444 

1.5 

28 

2.25 

42.0 

784 

1.5 

23 

2.25 

34.5 

529 

1.2 

18 

1.44 

21.6 

324 

1.3 

22 

1.69 

28.6 

484 

1.8 

18 

3.24 

32.4 

324 

3.5 

40 

12.25 

140.0 

1,600 

3.5 

65 

12.25 

227.5 

4,225 

Total 27.6 

429 

61.82 

970.1 

16,057 

Mean 1.97 

30.64 





* From James C. Muir and G. E. P. Smith, The use and duly of water in the Salt 
River Valley, Agriculltiral Experiment Station Bulletin 120, 1927. All the plots were on 
the same ty^ of soil, Maricopa sandy loam. 





136 


Simple Regression. Linear and Curvilinear 


it in appropriate formulas The data given m Table 8 I illustrate the 
necessary operations 

The computations shown in this table — squaring both X and Y, 
calculating the product XY, summing both X, Y, and the three columns 
of extensions, and dividing the first two sums by the number of cases 
to give the mean of X and X— provide all the basic data necessary ^ The 
values a and b for the regression equation may next be computed by sub- 
stituting these extensions in equations (5 2) and (5 3) 

2(A'y) - 970 1 - 14(1 97)(30 64) 




nX») - «(A/,)* 
125 0488 


61 82 - 14(1 97®) 


7 4874 


= 16 701 


= 3064 - 16 701(1 97) = -226 
The regression line, X =* a -h bX, therefore is for this case 
y= -2 26+ I6 70A’ 


The unadjusted coefficient of correlation, r„, may now be computed from 
the following new formula 

_ XiXY) - nM^Af, 

~ V(S(Jr«) - nM^U.y^) - nM^ ’’ 

970 I - 14(1 9W30 64) _ „ 

“ V[6I 82 - 14(1 97)*1(I6.057 - 14(30 64)*] “ 


It should be noticed that the numerator of this fraction is the same as 
that in the equation for b and that half of the denominator is the same, 
except that it is under the radical sign 
Comparison of equations (5 2) and (8 1) with equation (I 5) for the 
standard deviation 

'V n * 


shc'ws ihsi Skey xaay iie wfilSta mcve ssa^y 

^ i:(xr) - nM^Af, i:{xy) 

— < — ” 

_ 2(xr) - nAfJtf, _ 2(a:y) 

” nSgSg nSgSy 


(8 2 ) 
(8 3) 


‘ Where the number of cases to be handled is large various short cuts may be used to 
reduce the volume of compulation requned m computmg the sums of extensions SX*, 
SXy, and SX* See pages 455-460 of the 2d edition 



Practical Methods for Simple Regression and Correlation 

The second form, in each case, uses the notation YSnj) for 

as discussed in Chapter 5.^ The forms shown in equations (5 '>) 
(5.3) and (8.1), however, are the ones ordinarily used in actual computation 
and should be kept clearly in mind. 


satisfied the relation 

S'- 

and that 

>o 


'srr 

It follows that 

Sr 


= 5 : 
4 


The adjusted standard error of estimate squared (equation 7.3) is equal to 


Si. 

^‘n-1 


The adjusted standard error can be calculated conveniently by means of 
the following equation: 


_ /l6,057 - 14(30.642) 

A' 


.) 


(S.4) 


14-2 


[1 - (0.847)-] 


= V68.59r2 = 8.28. 


As noted earlier, though r„ = b.., is not the same as h.„. The 

former regression, showing the change in X for each unit change in 1' 
(that is, regarding the dependent factor as the independent factor instead), 
is obtained by modifying equation (5.2) to the following form:' 

2(2r7) - nMJu\. 

The new regression coefficient, h_,. shows the average change in water 
applied with each additional unit (10 pounds) of cotton harvested. With 
the quantity of water subject to human control, as in this case, this relation 

* The value of is sometimes called the product rr.orneut, and ~{r’j)lr; is called the 
corariance. , ^ 

’ When the correlation is perfect, so that r„= 1, the two regression cocincicnts wal 
have the definite relation = l/A-r- Under these conditions the regression lines wifi be 
identical, no matter which variable is regarded as the independent variaolc and wmen 
as the dependent. 



138 Simple Regression, Linear and Curvilinear 

would have httle meaning However, if it is desired to chart it on Figure 
8 I along with the other regression line, it can be charted according to 
the hnear regression equation 

X=a„ + b^Y 

The value of the new a can be computed by restating equation (5 3) 
in the form 

= A/, — b„M„ 

Equation (8 4) completes the computation of all the values needed* except 
the coefficient of determination d„, which is simply r|j, That is 
flr„ = 4 = (0 847)* = 0717 

Interpreting the Measures of Linear Regression. The next step is 
to examine the values of the several statistics which have been computed 
from the sample and see what they mean 
The coefficient of regression of Y on X, = 16 70, shows that on 
the average the acre yield of cotton increases 16 7 10-pound units, or 
167 pounds, for each additional acre-foot of water applied The constant 
a shows that with no water applied, a yield of —2 26 10-pound units, 
—22 6 pounds or less than no cotton at all, might be expected Since 
these results are based on observations extending from 1 2 acre feet of 
water to 3 5, the relauons shown by the regression line do not necessanly 
hold beyond those limits, and it is not certain what the yield would be 
when no water is appbed Extrapolating the regression line to that point 
has no meaning, by itself 
The regression equation 

Y= -2 26-1- 16 TCAT) 
or 

Yield = —22 6 + 167 (feet of water) 
then gives the yields of cotton estimated as most likely to be obtained from 
the quantity of water applied within the hmits of I 2 to 3 5 feet Figure 
8 1 shows how these estimated values, along the regression line, compare 
with the actual yields observed 

The standard error of estimate, 8 28 10-pound units or 82 8 pounds, 
shows that the (adjusted) standard deviation of the differences between 
the actual and the estimated values is 82 8 pounds of cotton Two lines 
have been drawn in Figure 8 1, at 82 8 pounds above and below the 
regression line It wiE be seen that of the 14 cases, 9 fell between these 
two Imes, or in the zone withm one standard error on either side of the 
regression line 

* Except also the calculation of Tneasm^ of reliability, as explained m Chapter 17 



■ Practical Methods for Simple Regression and Correlation jj? 

The coefficient of correlation, = 0.85. and the coefficient of det-r- 
mination, = 0.72, show that about 72 per cent of the variance in th- 
yield of this crop in this area, on the farms from which the^e records 
were obtained, could be accounted for by the differences in the quantity 
of water used in irrigation. Since this leaves only 28 per cent of the \-ariancc 
to be accounted for by aU other factors, it would appear that the quantity 



Water applied (acre-feet), X 


Fig, 8.1, Relation of yield of cotton to irrigation water applied; estimated yields from 
a linear regression and zone of probable yields indicated by the adjusted standard error 
of estimate. 


of water applied (or other factors associated with it) was the most important 
factor which was associated with the yield of cotton on these farms and 
on this type of soil.® 

The fact that 72 per cent of the variance in yield can be explained by 
corresponding differences in the quantity of water applied does not in 
itself mean that the differences in irrigation caused the differences in 
yield. For example, it might be possible that the quantity of water applied 
was regulated to conform to the fertility of the land and that the differences 
in yield were really due in part to the differences in fertility. The statistical 
measure merely tells how' closely the variance in one variable was associated 
with the variance in the other; w'hether that association is due to, or can 

* Methods ofadjusting these percentages forcstimaied biasarepresented in Chapter 17. 




140 Simple Regression, Linear and Curvilinear 

he taken as oidence of, cause-and-effect relation is another matter, and 
IS outside the scope of the statistical analysis (For more extended 
discussion of this point, see Chapters 25 and 26 ) 

Working Out a Curv///neor Refretslon. The next step is to consider 
•whether the straight line is adequate to describe the way that the yield 
increases as more water is applied, or whether a curve had better be em- 
ployed (This step can be taken before any of the linear results are worked 
out, and, if a curve is decided on, the previous work can be sbpped 
entirely, if desired ) 

\Vhere the equation of the curve has been determined by mathematical 
means, the standard error of estimate and the index of correlation may 
be computed without working out the estimates and residuals for each 
of the individual cases These methods will be described in Chapter 12 

Before fitting the curve, we must consider what type of curve it is 
logical to expect In most agricultural production problems, diminishing 
returns are expcnenced • That is, the application of successive increments 
of fertilizer or other productive aid on the same areas will be expected 
to produce a smaller and smaller increase m the product Also, it is 
known that if too much of some factors are applied, the result may be to 
produce a decline in output The decline after the point of maximum 
output is reached may be gradual, or it may be sudden, owing to a toxic 
clTect of too much of one substance upon the plant or animal These 
considerations would lead us to expect a curve with the following charac- 
teristics 

1 It should nse steeply at first, and then less and less sharply until a 
maximum is reached 

2 It might show a decline after the maximum is reached, either gradual 
or sharp 

3 It would have only the single point of maximum yield 

These are the conditions we shall apply m fitting the curve 

A mathematical curve can be denved expressing these logical conditions 
The curve which has been found most suitable to express them, however, 
has a quite complicated mathematical form, and cannot be fitted to the 
data by least-squares or other simple arithmetic methods We shall 
therefore use a freehand fit, keeping in mind the conditions stated 

Examining Figure 8 1 more closely, we see that, in the range up to 
1 8 acre-feet of water, the actual yields he below the regression line 4 
times, and above 4 times, in the range from 1 9 to 3 acre-feet, the actual 
yields lie above in all 4 observations, and above 3 acre-feet the 1 yield 

' WUIiam J Spillman, World Book Co , Yonkers-on 

Ihe-Hudson, New York, and Chicago, 1924 



Practical Methods for Simple Regression and Correlation / 4 / 

below the line is much farther below than is the 1 above These fac-s 
suggest that a curv-e convex from above, giving lower estimated yields 
than the straight Une for the lowest and highest applications of water and 
higher estimated yields for the intermediate applications, would more 
accurately represent the relations in this case. The facts also agree with 
the logical conditions stated. (The number of obsen'ations is too small 
to serve as an accurate indication of the shape of the curve, but it will 
serve at least as a simple illustration of the way the whole problem may be 
worked through.) 

The next step is to group the observations according to the value of X 
(the quantity of water) and average both X and F, water and peld 
(Table 8.2). In view of this small number of obsers’alions. rather iar^c 
groups are taken; were more cases available, the groups might be maJe 
narrower. 


Table 8.2 


Computation of Group Averages to Intiicate Regression Curve — 

Cotton Example 


X (water) 

1-1.4 

X (water) 

1.5-1 .9 

X (v/ater) 
2.0-2.9 

X (water) 
3.0-3.9 

X 

y 

X 

y 

y 

y 

X 

Y 

1.4 

16 

1.8 

26 

2.5 

45 

3.5 

40 

1.3 

9 

1.9 

37 

2.1 

44 

3.5 

65 

1.2 

18 

1.5 

28 

2.3 

38 



13 

22 

1.5 

23 







1.8 

18 





Sums 5.2 

65 

8.5 

132 

6.9 

127 

7.0 

105 

Means 1.3 

16.25 

1.7 

26.4 

2.3 

42.33 

3.5 

52.5 


These averages are then plotted, as shown in Figure 8.2, an irregular 
line dotted in connecting them, and as smooth a curve as possible which 
fulfills the stated conditions drawn in freehand through the av erages and 
the broken line, just as discussed in Chapter 6. This then gives the 
regression curve. It is seen to fit the data well, and yet to fulfill the logical 
conditions stated. The point of maximum yield, hov/ever, apparently 
lies beyond the limit of the observations. 

Next the estimated yields for each different appbcation of water are 
read off from this curve, and the differences between the actual and the 






142 


Simple Regression Linear and Curvilinear 

estimated yields are determined These residuals are then squared to 
determine their standard deviation, the T values are also squared so as to 
determine their standard deviation, and so give the basis for measuring 
the amount of correlation The sum of the Y’ values is slightly smaller 
than the sum of the Y values, and the mean of the z" values is therefore 
not exactly zero, but 0 264 That indicates that the curve shown in 
Figure 8 2 should be shifted up 0 264 umt, or 2 64 pounds, to make the 



Fig Relation of yield of cotton to irrigation water applied, estimated yields from 
a curvilinear regression, and zone of probable yields as indicated by the adjusted 
standard error of estimate 

estimated and actual averages agree ’ Representing this curve by / (X), 
the regression equation may therefore be wntten 

Y=k+fiX) 

Y=2M+fiX) 

' In problems with many observations, the sum of the Y values and of the Y" values 
may be determined separately for the several different portions of the curve to see if its 
position should be shifted in one portioR and not m another This process can be 
carried too far, however, for if the divisions are made too small the effect will be to 
make the curve pass through each successive group average, without smoothing out the 
irregularities into a contmuous funebon 



Practical Methods for Simple Regression and Correlation 143 

The values at the foot of Table 8.3 now give the sums necessaiv' to 
measure the closeness of the correlation. First the standard deviations 
of Y and’z" are computed, using the formula 


-J 




= 14.44 


/S(2'')2 - n{M%) _ im. 

z"- „ ~ N 


51 - 14(0.2642) 


14 


= 7.16 


Table 8.3 


Computation of Residuals and 
Regression- 

Standard Deviation for Curvilinear 
—Cotton Example 

Water 
per Acre, 
X 

Yield 

y 


Yield Estimated 
from X, 

Y' 

Y - Y', 

(zr- 

T- 


{10-pound 

units) 

{lO-pound units) 




1.8 

26 



29.0 

-3.0 

9.00 

676 

1.9 

37 



31.0 

6.0 

36.00 

1,369 

2,5 

45 



42.8 

2.2 

4.84 

2,025 

1.4 

16 



19.2 

-3.2 

10.24 

256 

1.3 

9 



16.8 

-7.8 

60.84 

81 

2.1 

44 



35.2 

8.8 

IIM 

1,936 

2.3 

38 



39.5 

-1.5 

2.25 

1,444 

1.5 

28 



21.9 

6.1 

37.21 

784 

1.5 

23 



21.9 

1.1 

1.21 

529 

1.2 

18 



14.2 

3.8 

14.44 

324 

1.3 

1.8 

3.5 

22 

18 

40 



16.8 

29.0 

54.0 

5.2 

-11.0 

-14.0 

27.04 

121.00 

196.00 

484 

524 

1,600 

3.5 

65 



54.0 

11.0 

121.00 

4,225 

Sums 

429 



425.3 

+3.7 

718.51 

16,057 

Then, by equation ( 

7.6), 

^ 1 

n ] 

I = (7.162) 1 


= 65.23 



— ml 

ll4 - 3/ 





144 


Simple Regression, L/neor and Curvilinear 

Here 3 is used for the value of m, since it is judged that a parabolic 
equation of type (a), Chapter 6, with 3 constants, would be adequate to 
reproduce the freehand curve 

The standard error of estimate for the graphic regression curve is 
thus 8 07 10 pound units, or 80 7 pounds This is 2 1 pounds smaller 
than the corresponding value in the case of the linear correlation, indicating 
how much more closely the curve fits the data than does the straight line, 
even after allowing for its greater flexibility In Figure 8 2 two dotted 
lines have been drawn in, each 807 pounds away from the regression 
curve, indicating the zone of estimate within which approximately two* 
thirds of the cases fall (10 out of 14 in this instance) and within which 
two-thirds of the actual yields may be expected to fall if new estimates 
of yield are made from the water apphed for additional cases drawn 
from the same universe (Note also the discussion, in Chapters 17 and 
19, of more exact methods of calculating confidence intervals for individual 
estimates ) 

The index of correlation may be computed from the formula 



which IS equivalent to Equation (7 8) In this case it works out to be 


Since the index of determination is simply it is 75 4 per cent The 
index of determination, 75 4 per cent, compares with the coefficient of 
determination of 71 7 per cent Apparently taking the curvilinear nature 
of the relations into account, has increased the proportion of the variance 
in yield accounted for by differences m water apphcation by 4 per cent 
of the total variance * (Only the measures of determination can be directly 
compared in this way If the coefficient of correlation, 0 847, were sub- 
tracted from the index of correlation, 0 868, that would give an incorrect 
idea of the importance of taking account of the curvilinear nature of 
the relation ) 

Interpreting the Measures of Cumlinear Regression. The index of 
determination and the accompanying standard error of estimate have 

' See Chapter 23 for tests as to whether this difference is large enough to be significant, 
and see Chapter 17 for corrections for number of constants involved By using the curve 
without these adjustments, the difference between r* and i* exaggerates the real gam, 
especially in small samples 



145 


Prccticof Methods for Sirr.ph Regmsicn cr.d Ccrrelcadn 

been interpreted for the cur^-e in much the same manner as were the 
coefficient of determination and the standard error of estimate for the 
straight line. In the case of the regression curve itself. ho'.vcver. a some\>. hat 
different method of presentation may be best, since a mathematical 
equation expressing the relation has not been computed. 

The regression curve just tvorked out for the cotton problem, for 
example, may be presented either as a curve showina araphicallv the yield 
to be expected lor various applications of water, as is illustrated in Fiaure 
8.2, or as a table showing the same thing, as in Table S.4. In both instances 
the constant which has been determined from the average of =’ is added 
to the values read from the curve in Figure 8.2. /{.r;. so as to sve the 
final estimates which would be made bv taking into account this sliaht 
shift in the position of the curve. 


Table 8.4 

Yield of Pu-ia Cotton. wTra Different Appucatios'S of Ip.rigation' 
Water, on Maricopa S.\n3Y Lo.am Soils in the Salt River Valley', 
Arizona, rs 1913. 1914, an-d 1915 


Irrigation Water 
•Applied 

Average Yield of 
Cotton Lint 

Acre feet 

Pounds per acre 

1.25 

156 

1.50 

222 

1.75 

2S3 

2.00 

535 

2.25 

385 

2.50 

431 


In preparing the table, the relation is shown only for that range of water 
application within which the bulk of the observations falls. Similarly, 
only this ranee should be shown by the solid line in the chart; a dotted 
line miaht be used to indicate the relations beyond that up to the extremes 
obser-ed. Neither the regression line nor cure should, ordinarily, be 
carried bevond the limits of the observ'ations on which it was based. 
Also, before eeneral conclusions are drawn as to the application of the 
results to cases other than those included in the sample (as, in this instance, 
to other fields in the same area), the standard errors set forth in Chapter 17 
and 19 should be calculated and confidence intcr'als based upon them 
should be included in the interpretation, for cur-es mathematically fitted. 



146 


Simple Regression, bnear and Curvitinear 


Summary 

This chapter has illustrated the way in which regression analysis 
may be applied to a specific problem, the manner in which linear and 
curvilinear regressions may be determined most simply, and the way 
in which they may be interpreted In addition, the simplest manner of 
computing the standard error of estimate and the coefficient and the index 
of correlation have been illustrated, and theic significance has been 
briefly discussed 



CHAPTER 9 


Three measures of correlation 
and regression — the meaning 
and use for each 


So many different statistical coefficients have been introduced in the 
discussion of correlation that there may be some confusion as to the 
meaning and use of the different statistics. Particularly in the linear 
situation, there are three statistics which summarize nearly all that a 
regression analysis reveals. 

First, the standard error of estimate indicates how nearly the estimated 
values agree with the values actually observed for the variable being 
estimated. This coefficient is stated in the same units as the original 
dependent variable, and its size can be compared directly with those 
values. 

Second, the coefficient of determination (r®) shows what proportion 
of the variance in the values of the dependent variable can be explained 
by, or estimated from, the concomitant variation in the values of the 
independent variable.^ Since this coefficient is a ratio, it is a “pure 
number”; that is, it is an arbitrary mathematical measure, whose values 
fall tvithin a certain limited range, and it can be compared only with 
other statistics like itself, derived from similar problems. 

Third, the coefficient of regression measures the slope of the regression 
line; that is, it shows the average number of units increase or decrease 
in the dependent variable which occur with each increase of a specified 
unit in the independent variable. Its e.xact size thus depends not only on 
the relation between the variables but also on the units in which each 
is stated. It can be reduced to another form, however, by stating each of 
the variables in units of its own individual standard deviation. In this 

* These statements are all subject to the error limitations set forth later, in Chapters 
17 and 19. 


147 



148 


Simph Regression, Linear and Curvilinear 

form It has been termed fi or the beta cocfRcjent * The relation between 
beta and the coefficient of regression may be indicated by stating the 
regression equation in both ways 

y = a + b,^X 



Stated in this way fi for the cotton yield problem is 0 845 That is, 
for each increase of one standard deviation {0 73 acre-foot of water) m 
X the yield of cotton increased 0 845 of one standard deviation Since 
the standard deviation of Y was 144 3 pounds, that is equal to 1219 
pounds of cotton for each 0 73 acre-foot of water This is at the rate of 
167 pounds of cotton for each foot of water, which is the same thing as 
was shown by the coefficient of regression However, for comparisons 
between problems where the standard deviations are much different, 
the beta coefficient may have value It is evident that in simple correlation 
the value of beta is the same as that of r 

Relation of the Different Coeffieients to Each Other. Even though 
each of the three coefficients measures certain aspects of the relation 
between variables, it does not follow that all three coefficients will vary 
together, or that a problem which shows a high coefficient of determin- 
ation will also show a high regression coefficient or a low standard 
error of estimate That is because they measure different aspects of the 
relation 

If either of the standard deviations involved has been artificially 
modified by selection of the sample, as mthe “regression model” of Chapter 
17, then the beta coefficient will be of less significance, just as will the 
correlation coefficient 

The particular usefulness of each of the three different groups of correla- 
tion measures is illustrated m Figure 9 1, which shows three sets of simple 
relationships, with hypothetical data 

Here the regression coefficient is smaller in (o) than (h) In (a) an 
additional inch of ram causes an average increase of 2 5 bushels m yield, 

’ See Truman Kelley Slalisiicat Method p 282, Macmillan New York, 1924, and 
George W Snedecor, Stalisiieal Methods 5lh ed , p 416, Iowa Slate College Press, 
Ames 1956 Snedecor uses the term * standard r^rcssion coefficicnl ’ for ^ 



Three Measures of Regression and Correlation I4p 

as compared with an increase of 3.1 bushels in (d). But in case (a), a 
considerable part of the variation in jield is apparently due to rainfall, 
as shown by the high correlation (r = 0.83) and the small size of the 
standard error of estimate (2.2 bushels); whereas in case (h), factors 
other than rainfall apparently cause most of the differences in yield, as 
indicated by the lower correlation (r = 0.71) and the larger standard 


r=083 



(a) 


r = 071 



<b) 


r sQiT 

4„ = OSS^=J15 



0l ^ ^ i ! 

6 8 to 

Rainfai! 

(c) 


Pig. 9.1. Hjpothetical sets of data, illustrating three tj-pes of correlation 

coefficients. 


error of estimate (3.8 bushels). In terms of determination apparently 
about 69 per cent of the differences in yield are related to differences in 
rainfall in the first case, and only about 50 per cent in the second. 

In comparison with (a) and (h), case (c) has much less variable yields, 
ranging only from about 8 bushels to 12 bushels, compared with a range 
of 8 to 21 in case {a) and 0 to 20 in case (b). Only a small part (22 per cent) 
of the variation in yields is associated with rainfall differences, as indicated 
by the low' correlation (0.47). An increase of 1 inch in rainfall apparently 
causes only 0.5 bushel increase in yield. Yet in spite of this low relation, 
it is possible to estimate yields more accurately, given the rainfall, in this 
case than in either of the other tw'o, as is shown by the standard error of 
estimate of 1.15 bushels as compared to 2.2 bushels for (a) and 3.1 for (h). 
The original variation in yields is so slight in case (c) that even the small 
relation shown to rainfall is enough to make it possible to estimate yields 
more accurately than in either of the other cases.^ 

These three cases illustrate the relative place of each of the three types 
of correlation measure. Case (6) shows the greatest change in yield for a 
given change in rainfall (the regression measure); case (a) shows the 

’ The standard deviation of Y is so small in case (c) (only about 1 .3) that >' could be 
estimated from M, in this case more closely than from the regression line in cither of the 
other cases. A similar case could be constructed with s, — 2.2 or more, and r very low, 
which would still have S„ smaller than in (a) or (6). 





150 Simple Regression, Unear and Curvilinear 

highest proportion of differences m yields accounted for by rainfall 
(the correlation or determination ineasure) , and case (c) shows the greatest 
accuracy of estimate (the error of estimate measure) Which of these 
measures should have most attention m a particular investigation depends 
upon the phase of the investigation which is most important the amount 
of change (regression), the proportionate importance (correlation), or 
the accuracy of estimate (standard error) All have their place, and none 
should be entirely overlooked or ignored 
In this chapter, we have not considered the question of the sampling 
reliability of the values of the statistics determined from samples with 
various numbers of observations, or of iheir confidence intervals, as that 
IS not relevant to the special points examined here Wc have also given 
little attention to the fact that correlation and determination coefficients 
have only limited meaning if the sample is not fully representative of a 
universe (note Chapters 17 and 18) 

REFERENCE 

Eiekiel Mordecai Meaning and significance of correlation coefficients, Amer Eton 
Rev . Vol XIX No 2 pp 246-50, June, 192$ 



SECTION III 


Multiple 
Linear Regressions 


CHAPTER 10 

Determining 
multiple linear regressions: 
(1) by successive elimination 


The Problem of Multiple Relations 

The relations studied up to this point have all been of the type where the 
differences in one variable were considered as due to, or associated with, 
the differences in one other variable. But in many t)'pes of problems the 
differences in one variable may be due to a number of other variables, 
all acting at the same time. Thus the differences in the yield of com from 
year to year are the combined result of differences in rainfall, temperature, 
winds, and sunshine, month by month or even week by week through the 
growing season. The premiums or discounts at which different lots of 
wheat sell on the same day var^' with the protein content, the weight per 
bushel, the amount of dockage or foreign matter, and the moisture content. 
The speed with w'hich a motorist will react to a dangerous situation may 
vary with his keenness of sight, his speed of nervous reaction, his intelli- 
gence, his familiaiity with such situations, and his freshness or fatigue. 
The price at which sugar sells at w holesale may depend upon the produc- 
tion of that season, the carryover from the previous season, the general 
level of prices, and the prosperity of consumers. The w eight of a child 
will vary' with its age, height, and sex. The volume of a given weight of 
gas varies with the temperature and the barometric pressure. 

The physicist and the biologist use laboratory' methods to deal with 
problems of compound or multiple relationship. Under laboratory 

JSI 



152 Determining Multiple Unear Regressions 

conditions all the variables except the one whose effect is being studied 
can be held constant, and the effect determined of differences in the one 
remaining varying factor upon the dependent variable, while effects of 
differences in the other variables are thus eliminated In the case of a 
gas, for example, the temperature may be held constant while the volume 
at different barometric pressures is determined experimentally, and then 
the pressure held constant while the volume at different temperatures is 
determined For many of the problems with which the statistician has to 
deal, however, such laboratory controls cannot be used Rainfall and 
temperature and sunshine vary constantly, and only their combined effect 
upon crop yields can be not^ The astronomer or meteorologist can 
observe events in the heavens or the atmosphere, but cannot control 
them (though seeding clouds to make ram is a step in that direction) 
Economic conditions are constantly shifting, and only the total result of 
all the factors in the existing situation can be measured at any time 
And so on through many ocher types of multiple relations similar to those 
mentioned — the statistician has to deal with facts ansing from the complex 
world about him, and frequently has but little opportunity to utilize 
laboratory checks or artificial controls 
Theoretfeol Example Where a dependent variable is influenced not 
only by a single independent variable, as in the relation of Y to X, but 
also by two or more independent variables, we can represent the relation 
symbolically by the equation 

X, = a + b,Xt + b^X:,+ b^X^ (101) 

Here A'l represents the dependent variable, and X^, X 3 , X„ represent 
the several independent variables 

The meaning of the several constants in this equation and the way m 
which It may be interpreted geometrically can be shown by making up a 
simple example 

Let us assume that in a new irrigation project the farms are all alike 
m quality of land and kinds of buildings and that the price at which each 
one is sold to the settlers is computed as follows 
Buildings, $1,000 per farm 
Irrigated land, $100 per acre 
Range (non-irrigated) land, $20 per acre 

Using Xi to represent the selling price per farm in dollars, X 2 to represent 
the number of acres of irrigated land in each farm, and X^ to represent the 
number of acres of range land, we can state the method of computing the 
selling pnee m the single equation 

ATi = 1,000 + lOO^Tg + laXi 



By Successive Elimination ;53 

The relations stated in this equation may be represented graphically 
as shown in Figure 10.1. The representation is broken up into halves. 
The upper half shows the relation of farm value to irrigated land for farms 
that have no range landj the lower half shows the relation of farm value 
to range land for farms that have no irrigated land. This figure is con- 
structed exactly the same as was Figure 5.2. Thus in the upper section of 




Fig. lO.i. Graph of the function Aj = 1,000 H- lOGAT. 20A'j. 


Figure 10.1, each increase of 1 unit in Xo, as, for example, from 3 to 4, 
adds SlOO to the farm value. Similarly, in the lower section, each change 
of 1 unit in X^, as, for example, from 5 to 6, adds $20 to the farm value. 
In each case, as for zero acres, the line begins with the value of a, 1,000, 
to cover the value of the buildings. 

Equation (20.1) is called the multiple regression equatioji. The term 
multiple is added to indicate that it explains in terms of two or more 
independent variables, X^, X^... X„. The coefficients and hg are 
termed net regression coefficients. The term net is added to indicate that 
they show the relation of X-^ to X^ and X^, respectively, excluding, or 





154 


Determining Muhip}e Linear Regressions 


net of, Ihe associated influences of the other independent variable or 
variables In contradistinction, the regression coefficient of equation 

(5 1 ), 




may be termed the gross regression coefficient The term gross is added 
here to indicate that it shows the apparent, or gross, relation between 
Y and X without considering whether that relation is due to X alone, 
or partly or wholly to other independent variables associated with X 

The difference between the net and gross regression coefficients may 
be further shown by a simple arithmetic illustration, based on the farm- 
value formula just discussed 

Let us take a dozen assumed farms and calculate from the pricing 
equation what their selling pnces should be In setting up these jllustTative 
farms, let us assume further that m general the farms with large irrigated 
areas had small range areas and those with lutle irrigated land had larger 
amounts of range land Under these conditions the computation works 
out as shown m Table 10 I 


Table t0.( 


Computation of Estimated Selling Price 
Xi - 1,000 -I- lOOrj •¥ ZOA’s 


Observation 

Number 

Xi 

(0 

(2) 

loo-tr^ 

(3) 

20A'j 

(4) 

Calculated Values of Xi 
(3) + (4) -1- 1,000 

1 

S 

5 

800 

100 

1,900 

2 

4 

5 

400 

100 

1,500 

3 

3 

10 

300 

200 

1,500 

4 

7 

8 

700 

160 

1,860 

5 

7 

10 

700 

200 

1,900 

6 

8 

15 

800 

300 

2 100 

7 

6 

12 

600 

240 

1,840 

8 

I 

15 

100 

300 

1,400 

9 

4 

17 

400 

340 

1,740 

10 

2 

22 

200 

440 

1,640 

11 

4 

20 

400 

400 

1,800 

12 

5 

13 

500 

260 

1,760 




By Successive Elimination f55 

The apparent relation of the values of A'j. as just computed, to and 
X3 may be shown by preparing dot charts of the A'j to A; relation and the 
Xj to X3 relation. These dot charts are shown in Figure 10 . 2 . 

Examining this figure, we find that ATj is fairly closely related to A; 
but that it has no definite relationship to X3. We could calculate the 
regression lines for each of the two relationships shown. The regression 
coefficient, for the first comparison, would show the average change 
in Xj with unit changes in X^. The regression coefficient, B,3, for the second 
comparison, would show the average change in with unit changes in 



Acres of irrigated land, 



Acres of range land, A'3 


Fig. iO.2, The apparent relation of farm value to acres of irrigated land and 
to range land reveals little of the underlying net relationship. 

A'3. The latter coefficient would come very close to zero, to judge visually 
from the chart. Both these would be gross regression coefficients, measur- 
ing only the apparent relation betw’een A'j and each of the other variables. 
We know in this case that the values of A'j are completely determined by 
the values of A'o and A'3. If we could hold constant, or eliminate, the 
true effect of A'o "on A'j. w-e should find that the relation of the corrected 
values of A'j to A'3 was just as close as to A',. In spite of the fact that the 
cross regression, Bjg, appears to be zero, the net regression, b^, is 
really 20. 





IS6 Determining Multiple Linear Regressions 

By using the known net regression of Xi on X^, we can correct the Xi 
values to eliminate that part of their variation which is due to X^ and then 
relate the remaining fluctuation to X^ Let us do that by subtracting b^X^ 
from Xi This process is shown in Table 10 2 


Table 10.2 


Correction of CoMPinto Xi for Contribution of X^ 


Observation 

Number 

(I) 

X 2 

(2) 

(3) 

(4) 

b,_Xi 

(lOOA',) 

(5) 

Xi ~biXi 
(6) 

1 

8 

5 

1,900 

800 

1,100 

2 

4 

5 

1,500 

400 

l.IOO 

3 

3 

10 

1,500 

300 

1,200 

4 

7 

8 

1,860 

700 

1,160 

5 

7 

10 

1,900 

700 

1,200 

6 

8 

15 

2,100 

800 

1,300 

7 

6 

12 

1,840 

600 

1,240 

8 

1 

15 

1,400 

100 

1,300 

9 

4 

17 

1,740 

400 

1,340 

10 

2 

22 

1,640 

200 

1,440 

11 

4 

20 

1,800 

400 

1,400 

12 

5 

13 

1.760 

500 

1,260 


We can now plot the values of Xi corrected for A\, or — b^X^ as 
shown in the sixth column, against the X^ value, as shown in the third 
column The resulting dot chart is shown tn Figure 10 3 



Fig. 10 3. After Ihe nel innuencC of irrigated land has been removed, the 
underlying relation of farm value to acres of range land is very clear 




By Successive Elimination f£j 

This figure now shows the underlying relation between .V, and 
with all the dots falling exactly on one straight line. If we now draw 
in the regression line and calculate its slope, we shall find it is exactly 
the same as the line for 63 , which was illustrated in the lower section of 
Figure 10.1. Figure 10.3 illustrates the net regression of A'j on A's. as 
contrasted to the gross regression which was represented by the lower 
section of Figure 10.2, If A'j were similarly corrected for and the 
values A:'j — b^X^ were plotted against X^, the net regression of A^j on 
Xo would similarly be shown. (This step is left for the student to 
perform.) 

If we had not known the underlying relationships as given in this case 
to start with, but merely had the scries of observations of ATj, A 2 . and A '3 
shown in Table 10.1 and Figure 10.2 would it be possible to work out 
from those observations the underlying, or tief, relationships? That is 
the problem which next will be explored. This lime we shall use a series 
where we do not know the relationship, and sec how we can proceed to 
to work it out. Also, as in most practical cases, w'e shall use an example 
where all the causes of variation arc not known and where we must deal 
with independent variables which explain only a part of the variation 
in the dependent variable. 

Illustrative Example. The problem of multiple relations is illustrated 
by the data in Table 10.3. These represent 20 farms in one area, with 
varying crop acreages, dairy cows, and incomes.^ To determine from 
these records what income might be expected, on the average and under 
the same conditions, with a given size of farm and with a given number of 
cows, it is necessary to estimate the effects of differences in the number 
of acres and also of differences in the number of cow’S on income. 

From these data it would seem that both the size of the farm and the 
size of the dairy herd influence farm income, to judge from dot charts 
showing the relation of income to acres (Figure 10.4) and of income to 
number of cows (Figure 10.5). It appears from these charts that there 
may be a slight tendency for the farms with the larger acreage in crops 
to have larger incomes and a rather marked tendency for the farms with 
the larger number of cows to have larger incomes. 

Analysis by Simple Averages not Adequate. The simple comparison 
alone, how’ever, is not sufficient to tell exactly how- incomes change with 
acres and with number of cows. That is because there is a marked relation 
between the size of the farms and the number of cow's, as is illustrated in 

' Tlie dollar income levels reflect conditions on small to medium-sized dairj' farms in 
Wisconsin in the late 1920's. In the 1950's, income from farms with these sizes and 
numbers of cows would have been roughly twice as large. The regression coefficients 
expressing the net effects of acres and cows would also be about twice as large. 



IS8 Determining Multiple Linear Regressions 

Tabie iO.3 


Acres Number of Cows, and Incomes, for 20 Farms 


Record No 

Size of Farm 

Size of Dairy 

Income 

1 

Number of acres 

60 

Number of cows 

18 

Dollars per year 
960 

2 

220 

0 

830 

3 

180 

14 

1,260 

4 

80 

6 

610 

5 

120 

1 

590 

6 

100 

9 

900 

7 

170 

6 

820 

8 

110 

12 

880 

9 

160 

7 

860 

10 

230 

2 

160 

11 

70 

17 

1,020 

12 

120 

IS 

1,080 

13 

240 

7 

960 

14 

160 

0 

700 

15 

90 

12 

800 

16 

no 

16 

1,130 

17 

220 

2 

760 

18 

no 

6 

740 

19 

160 

12 

980 

20 

80 

15 

800 



Ftg. (0 4. Correlation chart of acres and 
income on individual farms 



Fig 10 5. Correlation chart of number of 
cows and income on individual farms. 




8/ Successive Elimination 

Figure 10.6. There is a definite tendency' in this case for the larger farms 
to have smaller dairy herds. As a result, the diiTerence in income.^ in 
Figure 10.4, which appeared to be directly associated with differences in 
acreages, may reflect in part the differences in the sizes of the dairy herds 
on the farms with different acreages in crops. If we make groups of farms 


250 


200 


150 


100 


50 


1 

1 • 

» • 

• 

j 

1 

• 

« 

• 

.1 

• ( 

• 

• 

1 

• 

* ! • 


• 

• i 
« 

1 • 

i * 

0 5 10 15 2 


Cows 


Fig. 10.6. Correlation chart of number of cows and number of acres 
on individual farms. 


of 50 to 99 acres, 100 to 150 acres, and so on, and average the acres, cows, 
and incomes for each group, as is shown in Table 10.4. we find a marked 
difference in the number of cows from group to group, as well as in the 
number of acres and in the incomes. 


Table 10.4 

Average Number of Cows and Income, for Far.ms of Different Sizes 


Size Group 
(acres) 

Number 
of Farms 

A^'erage Size 

Average Size 
of Dairy 

Average Income 

50-99 

5 

Number of acres 
76 

Number of 

13,6 

Number of dollars 
838 

100-149 

6 

112 

9.8 

887 

150-199 

5 

166 

7.8 

924 

200-249 

4 

228 

2.8 

828 


The farms of 50 to 99 acres, with an average size of 76 acres, have 
incomes which average S838 ; the farms of 1 50 to 199 acres, with an average 
size of 166 acres, show incomes which average S924. Does this difference 
in income reflect the difference in size? Before this can be definitely 



160 Determining Multiple Linear Regressions 

answered we must consider that the two groups also differ in the average 
number of cows, with 13 6 m the first group and only 7 8m the second 
So far, there is nothing to indicate whether the difference m income is 
associated with the difference m the size of the farms or m the number 
of cows, we have shown that both vary from group to group, and that 
IS all 

If, on the other hand, we should attempt to determine how far income 
varied with differences m the number of cows by classifying the records 
with respect to the number of cows, and averaging incomes, we should 
secure the result shown m Table 10 5 


Table 105 

Average Acres and Income, for Farms with Different Numbers of Cows 


Size of Herd Number 
(cows) of Farms 

Average Size 
of Dairy 

Average Size 
of Farms 

Average Income 



Nurtiber of cows Number of acres Number of dollars 

Under 5 

5 

10 

190 

728 

5-9 

6 

68 

143 

815 

10-14 

4 

12 5 

135 

980 

15 and over 

5 

162 

88 

998 


Even though the income is higher on the farms with more cows, Table 
10 5 does not indicate how much of that can be credited to the cows and 
how much to other factors It is evident from the table that as the number 
of cows goes up, the number of acres goes down, arc the differences in 
income associated with changes in number of cows, m number of acres, 
or m part with both'^ 

Eliminating the Approximate Influence of One Variable What 
we need to know is how far income vanes with size of farm, for farms with 
the same number of cows, and how far income vanes with the number 
of cows, for farms of the same number of acres One way of determining 
'KWi’/i bt A-a tdywA K'ttt w/tcflw Wi taaVi feiTTi A'C. ’Are 

due to (or associated with) the number of cows, and then compare the 
adjusted incomes with the size of the farm to determine the effect of size 
on income To start this process, the effect of the number of cows upon 
incomes is needed We can secure an approximate measure of this 
by determining the straight-line equation for estimating incomes from 
cows — approximate only, since the differences in the size of the farms 
are ignored at this point 




f6l 


By Successive Elimination 

Determining the straight-line relation according to Chapter 5, we find 
that the apparent relation between cows and income is eiven bv the 
equation: 

Income = 694 -r 20.1 1 x number of cows 

According to this equation, farms with no cows averaged about 
694 income, and these incomes increased 20.11 for each cow added, 
on the average. Knowing this relation, we can adjust the incomes on the 
several farms by deducting that part of the income which would be 
assumed to be due to the cows, according to this average relation. 

Table 10.6 illustrates the process of adjusting the incomes to a no-cow 
basis, by subtracting this approximate effect of cows on incomes. The 
next step is to see what the relation is between the acres in the farm and 

Table 10.6 


Adjusting Farm Incomes for Differences rx Number of Cow’s 


Size of Farm 

Size of Dairy 

Income 

Income Assumed Income Adjusted 
Due to Co\v3 to No-Cow Basis 

Number of 

Number of 

Number of 

Number of 

Number of 

acres 

COWS 

dollars 

dollars 

dollars 

60 

18 

960 

362 

598 

220 

0 

830 

0 

830 

180 

14 

1,260 

282 

978 

80 

6 

610 

121 

4S9 

120 

1 

590 

20 

570 

100 

9 

900 

181 

719 

170 

6 

S20 

121 

699 

110 

12 

880 

241 

639 

160 

7 

860 

141 

719 

230 

2 

760 

40 

720 

70 

17 

1,020 

342 

678 

120 

15 

1,080 

302 

778 

240 

7 

960 

141 

819 

160 

0 

700 

0 

700 

90 

12 

800 

241 

559 

110 

16 

1,130 

322 

808 

220 

2 

760 

40 

720 

110 

6 

740 

121 

619 

160 

12 

980 

241 

739 

80 

15 

800 

302 

498 



162 Determining Multiple Linear Regressions 

these adjusted incomes Plotting both on a dot chart. Figure 10 7, shows 
this relation graphically Companng this figure with Figure 10 4, where 
the relation between the acres and the unadjusted incomes were plotted, 
we see that the relation is much closer and more definite for the 
adjusted incomes than for the unadjusted incomes This is only natural, 
now that the marked relation of number of cows to income has been 
removed, even if only approximately, the underlying relation of size to 
income can be more clearly seen 



Fig 10 7 Relation of income adjusted for number of cows, to 
number of acres 

It IS evident from Figure 10 7 that size has a more marked effect upon 
income than was apparent m Figure 10 4, where the effect of cows was 
mixed in also As was pointed out earlier, the fact that cows and 
acres were correlated meant that the effects of differences m cows were 
mixed in with the effects of differences in acres Now that the effect of 
cows has been at least roughly removed, the change m incomes with 
changes m acres can be more accurately determined 

Fitting straight lines to the relations shown m Figures 10 4 and 10 7, 
to deterrmne the average change in income with changes in acres, we 
obtain regression equations as follows 

Income = 868 74 + 0 0234 x number of acres 
Income, effect of cows removed s= 508 51 + I 33 x number of acres 
(For these early calculations, these values might be rounded off by using 
only, say, 3 significant digits in the values for a, b^, and ) 

It IS evident that the calculation of the effect of acres upon income with- 
out making some allowance for the effect of the correlated variable, 
number of cows, m this case would have senously underestimated the 
effect of acres upon income Such a determination would have shown 
only 0 02 increase m income with each acre increase in size, whereas the 
later determination shows 1 33 increase instead 




By Successive Elimination 

The relation now shown between income (in dollars) and acres illustrates 
the extent to which one variable may really inliuence a second, even thoush 
its influence is concealed by the presence of a third variable. From Fiaurc 
10.4, which indicates that there is practically no correlation between acres 
and income, one might conclude that differences in income were not at 
all associated with differences in acreages yet when the variation in income 
associated with cows is removed, even by the rough method shown, a 
very definite relation of income to size is found. For that reason one 
cannot conclude that, because two variables have no correlation, they 
are not associated w'ith each other; the lack of correlation may be due to 
the compensating influence of one or more other variables, concealing 
the real relation. 

Eliminating the Approximate Influence of Both Variables. We 
now have two equations, one showing the effect of cows upon income 
and the other the effect of acres: 

(A) Income = 694 + 20.1 1 x number of cows 

(B) Income, effect of cows removed, 

— 508.51 + 1.33 X number of acres 

These two equations can be combined into a single equation by taking 
that part of the first one which shows the increase in income for each 
cow and adding it to the second one. This gives an equation which 
includes allowances for both factors, as follows: 

(C) Income = 508.51 + 1.33 x number of acres 

+ 20.11 X number of cows 

The last equation gives a basis for indicating the effect of both acres 
and cows on income and for computing the income that might be expected, 
on the average, with a farm of a given size and with a given number of 
cows. For example, for a farm of 120 acres and 15 cows, the e.xpected 
income would work out as follows: 

Income = 508.51 + 1.33(120) -f 20.11(15) 

= 508.51 + 159.60 -F 301.65 = 970 

If 5 cows were added, making it 120 acres and 20 cows, the estimated 
income would be: 

Income = 508.51 -F 1.33(120) + 20.11(20) 

= 1,070 

Equation (C) can be used as illustrated to work out what income might 
be expected for each farm. The estimated income can then be compared 
with the actual income and the difference, if any. determined. 



164 Determining Multiple bnear Regreislom 

As given in Table 10 7, the estimated incomes still \ary somewhat 
from the actual This is just another way of saying that all the differences 
m income cannot be accounted for by the effect of differences m acres and 
in cows, according to the relations summarized m equation (C) This 
failure of the estimated values to agree exactly with the original values 
IS seen graphically in Figure 10 7 by the fact that all the dots do not he 

Table 10.7 


Actual Income and Income Estimated from Number of Acres and Cows 


Acres Cows 

Compulation of Estimated Income 

Estimated 
Income, 
(A) + (Q 
+508 51 

Actual 

Income 

Actual 

Income 

Minus 

Estimated 

Income 

Estimate for Acres, 

1 33 X acres 
(A) 

Estimate for Cows 
20 1 1 X cows 
(O 

60 

18 

80 

362 

950 5 

960 

95 

220 

0 

291 

0 

got S 

830 

28 5 

180 

14 

239 

282 

1,029 5 

1 260 

230 5 

80 

6 

106 

121 

735 5 

610 

-125 5 

120 

1 

160 

20 

688 5 

590 

-98 5 

100 

9 

133 

181 

822 5 

900 

77 5 

170 

6 

226 

121 

855 5 

820 

-35 5 

no 

12 

146 

241 

895 5 

880 

-155 

160 

7 

213 

141 

862 5 

860 

-2 5 

230 

2 

306 

40 

854 5 

760 

-94 5 

70 

17 

93 

342 

943 5 

1,020 

76 5 

120 

IS 

160 

302 

970 5 

1,080 

1095 

240 

7 

319 

141 

968 5 

960 

-8 5 

160 

0 

213 

0 

721 5 

700 

-21 5 

90 

12 

120 

241 

869 S 

800 

-69 5 

no 

16 

146 

322 

976 5 

1,130 

153 5 

220 

2 

293 

40 

841 5 

760 

-81 5 

no 

6 

146 

121 

775 5 

740 

-35 5 

160 

12 

213 

241 

962 5 

980 

175 

80 

IS 

106 

302 

9165 

800 

-1165 


exactly along the regression line Subtracting the estimated values from 
the actual values gives the residual differences of the actual income above 
or below the income estimated from the two factors, acres and cows 
Correcting Results by Successive Elimination It may now be 
recalled that, even though the incomes were adjusted to eliminate the 
effects of cows upon income before determining the relation between 



By Successive Elimination 

income and acres, the determination of the relation between income and 
cows was-made without making any allowance for the concurrent effect 
of acres. Since we now have an approximate measure of the effect of 
acres determined while eliminating to some extent the effect of cows, we 
can use that new measure, equation (B), to adjust the incomes for the 
effect of the acres and then get a more accurate measure of the true effect of 
cows alone upon incomes. This process is shown in Table 10.8. Here 

Table 10.8 


Adjusting Farm Incomes for Differences in Nu.mber of Acres 


Size of Farm 

Size of Dair)' 

Income 

Income Estimated 
for Acres, with no 
Co^vs 

Income with Effects 
of Acreage 
Differences 
Eliminated* 

Number of acres 

Number of cons 

Dollars 

Number of dollars 

Number of dollars 

60 

18 

960 

588 

372 

220 

0 

830 

801 

29 

180 

14 

1,260 

748 

512 

80 

6 

610 

615 

-5 

120 

1 

590 

669 

-19 

100 

9 

900 

642 

258 

170 

6 

820 

735 

85 

110 

12 

880 

655 

225 

160 

7 

860 

722 

138 

230 

2 

760 

815 

-55 

70 

17 

1,020 

602 

418 

120 

15 

1,080 

669 

411 

240 

7 

960 

828 

132 

160 

0 

700 

121 

—22 

90 

12 

800 

629 

171 

110 

16 

1,130 

655 

475 

220 

2 

760 

802 

-42 

110 

6 

740 

655 

85 

160 

12 

980 

722 

258 

80 

15 

800 

615 

185 


* Where the actual income is below that expected for a farm of that size with no cows, 
the deficit is indicated by the minus sign. 


estimates of income are worked out by equation (B) on the basis of acres, 
showing what the incomes might be expected to average if all the farms 



Determining Multiple Linear Regressions 

had no cows The difference between these estimates and the actual 
incomes may then be considered to be the part due to cows alone, while 
eliminating the effect of differences m the numbers of acres On the first 
farm, for example, equation (B) indicates that with no cows the income 
for 60 acres should be 588 Subtracting this from the 960 actually received 
leaves 372 as the income apparently accompanying the 18 cows 



^•g 10 6 Relation of income adjusted for number of acres to 
number of cows 

The adjusted incomes may then be plotted on a dot chart with the 
number of cows as the other variable, as shown in Figure 10 8 Comparing 
this figure with Figure 10 5, where the number of cows was plotted against 
income without first making any adjustment in the original incomes, we 
easily see how much closer the relation is after making the adjustment 
Further, it is evident that cows have a greater effect upon income than 
was indicated by the earlier companson Computing the straight line 
relationship for Figure 10 8 gives the following equation 
(D) Income, adjusted to constant acres, 

= —68 77 + 27 88 X number of cows 

By this last computation [equation {D)J, each increase of one cow is 
accompanied by an average increase in income of 27 88^ whereas according 
to the earlier companson [equation (A)], the increase was only 2011 
The second value is larger than the first, again showing the necessity of 
making due allowance for the effect of one factor before the true value of 
the other can be properly measured 

Now that we have a new measure of the effect of cows, we might go on 
to adjust incomes for cows by this new measure and then get a revised 
value for the effect of acres upon mcomes on a no-cow basis, in place of 




167 


By Successive Elimination 

the relation shown in equation (B). This possibility of further correction 
will be referred to later. But before that we will m'akc some experiment, 
with the new equation (D). 

We now have equations for the relation of incomes, adjusted for 
the other factors, to the remaining factors. These two equations. (B) 
and (D), are: 

(B) Income, effect of cows removed. 

== 508.51 -f 1.33 X number of acres 

(D) Income, adjusted to constant acres, 

= —68.77 + 27.88 x number of cows 

These two equations may be combined to give a revised equation to 
indicate the effect of both cows and acres upon incomes; 

(E) Income = 439.74 + 1.33 x number of acres 

-f 27.88 X number of cows 

Equation (E) is exactly the same as the previous equation (C) except 
that the revised effect of cows is included, and the constant term has also 
been changed owing to changing the allowance for cows. 

In exactly the same way that equation (C) could be used to work out 
the estimated income for any given combination of cows and acres, 
equation (E) can be used also. Thus for 120 acres and 15 cows, it would 
give 

Estimated income = 439.7 + 1.33(120) -f 27.88(15) 

= 439.7 + 159.6 + 418.2 := 1,018 

The result, 1,018, is 48 higher than the 970 worked out by equation (C). 
This higher estimate is due to the fact that equation (E) makes a larger 
allowance for the effect of each cow, and 15 is more than the average 
number of cows. If less than the average number of cows were used, 
equation (E) would give a lower estimate than equation (C). 

Working out the estimated incomes for each of the original observations 
according to equation (E), we obtain results as shown in Table 10.9. 

Comparing the residuals, or differences between the actual and estimated 
income, obtained by means of this new equation with those obtained using 
the equation in its first form (shown in Table 10.7). we sec that in more 
than half the cases they are smaller with the revised form. A more definite 
comparison can be made by computing the standard deviation of the 
residuals in each case. The standard deviation of the residuals shown in 
Table 10.7, using equation (C), is 90.29, whereas the standard deviation 



168 Determining Multiple Linear Regressions 

of the residuals shown m Table 10 9, using equation (E), is but 78 70 It 
IS apparent from this that the revised equation, determined after the effects 
of the other variable had been more closely allowed for, gives more 
accurate estimates of income than does the original equation in which 
the effects of the other variable had not been so fully eliminated 

Table 10 9 


Actual Income and Income Estimated from Number of Acres and Number 
OF Cows, Revised Relations 


Acres Cows 

Computation of Estimated Income 


Actual 

Income 

Actual 

Income 

Minus 

Estimated 

Income 

Estimate for Acres Estimate for Cows Income, 

1 33 X acres 27 88 x cows (A) + (B) 

(A) (B) +439 7 

60 

18 

80 

502 

1.021 7 

960 

-61 7 

220 

0 

293 

0 

7327 

830 

97 3 

180 

14 

239 

390 

1068 7 

1260 

191 3 

80 

6 

106 

167 

7127 

610 

-1027 

120 

1 

160 

28 

627 7 

590 

-37 7 

100 

9 

133 

251 

823 7 

900 

76 3 

170 

6 

226 

167 

832 7 

820 

-117 

no 

12 

146 

335 

920 7 

880 

-407 

160 

7 

213 

195 

847 7 

860 

123 

230 

2 

306 

56 

80J 7 

760 

-417 

70 

17 

93 

474 

1,0067 

1,020 

133 

120 

15 

160 

418 

1,017 7 

1,080 

62 3 

240 

7 

319 

195 

953 7 

960 

63 

160 

0 

213 

0 

652 7 

700 

47 3 

90 

12 

120 

335 

894 7 

800 

-94 7 

no 

16 

146 

446 

1,031 7 

1,130 

98 3 

220 

2 

293 

56 

788 7 

760 

-28 7 

no 

6 

146 

167 

752 7 

740 

-127 

160 

12 

213 

335 

987 7 

980 

-7 7 

80 

15 

106 

418 

963 7 

800 

-163 7 


It was suggested previousfy that the last corrected values for the relation 
of cows to income gave a new basis for correcting income so as to measure 
more accurately the relation of acres to income This in turn would give 
a new basis for measuring the effect of cows, and so on, until a final stable 
value had been reached So long as a new correction would result in a 



By Success/Ve Elimination 

further change in the computed effect of either variable, the new values 
would give a better basis for estimating income than did the previous 
values. Only when the point was reached where no further chance need 
be made in the effect of either variable could it be said that the relation 
of each variable to income had been quite correctly measured while 
allowing for the influence of the other factor, and that might involve a 
large number of successive corrections. 

This method of allowing for the effect of other factors so as to determine 
the true relation of each one to the dependent factor (as income, in this 
case), by first correcting for one, and then for another, is known as the 
method of successive elimination. This method can be used where there 
are three or more independent factors related to (or accompanying varia- 
tions in) a dependent (or resultant) factor, just as it was used here for two 
factors, except that then the dependent needs to be corrected in turn to 
eliminate the effects of all the other independent factors except the particular 
one whose effect is being measured. But although it is possible to measu.’-e 
the relations by this method, it would be a very slow and laborious 
process. A shorter mathematical method which gives the same result 
by more direct processes is available instead. This method, known as the 
method of multiple regression, is presented in Chapter 1 1. 

Summary 

This chapter has shown that when two related factors both affect a 
third factor it is difficult to measure the extent to which the dependent 
variable is associated with one independent factor without making allow- 
ances for its relation to the other independent variable. Allowing for 
this duplication by eliminating the effects of each factor in turn (successive 
elimination) can gradually determine the net association with each, but 
the method is long and laborious. 

REFERENCE 

Foote, Richard J., The mathematical basis for the Bean Method of graphic multiple 
correlation, Jour. Amer. Slat. Assoc., Vol. 48, pp. 778-88, 1953. 



CHAPTER 1 1 


Determining 
multiple regressions: 
(2) by fitting a linear 
regression equation 


In equation (E) m Chapter 10 it was shown that an equation could be 
arnved at lo express the average relation between income, acres, and 
cows, as follows 

Income “ 439 74 + 1 33 x number of acres + 27 88 x number of cows 
If we designate the three series of variable quantities, income, acres, 
and cows, by the symbol X with different subscripts, using to represent 
dollars of income, to represent number of acres, and X^ to represent 
the number of cows, we can rewrite the equation m the form 
Xi = 439 74 + ] i3Xt + 27 B^X, 

If now we use the symbol a to represent the constant quantity 439 74, 
^2 to represent 1 33, the amount which Xi increases for each increase of 
one unit in Xj (one acre), and to represent 27 88, the amount which 
Xi increases for each increase of one unit in X^ (one cow), the equation 
appears as 

Xi^a + b^X^ + baXs ( 111 ) 

Comparing this equation with the regression equation for the straight- 
line relation between two variables 

Y= a + bX 

we see that the two equations are just alike, except for the difference in 
the symbols used to represent the different vanables and for our having 
added the expression for an additional variable In equation (11 1), Xi, 

170 



By Least Squares 

the variable which is being estimated, is termed the dependent variable, 
since its estimated value depends upon those of the other variable or 
variables; and and X^ are termed independent variables, since their 
values are taken just as observed, independent of any of the conditions of 
the problem. Since there is more than one independent variable concerned, 
the equation is said to be a multiple estimating equation, or a multiple 
linear regression equation. 

Chapter 10 showed that the values of the constants a, b.. and £. 3 , could 
be worked out by a cut-and-try' method which gradually approached 
nearer and nearer to the right values. For any particular criterion of 
“rightness” only one set of values for these constants can be exactly right. 
If the criterion of “rightness” is taken as that which will make the standard 
deviation of the residuals, when income is estimated from the other two 
variables, as small as possible, the values of a, b^, and b^ which will give 
this result can be determined by a direct mathematical process, known as 
the method of linear multiple regression. 

Determining a Regression Equation for Two Independent Variables. 
The best values for a, Aj, and b^ in the multiple regression equation (11.1) 
can be worked out by an extension of the same process used in working 
out the values for the estimating equation when only one independent 
variable was considered. Just as before, the value of the b constants will 
be determined first from equation ( 11 . 2 ) and then the a values will be 
worked out from them;^ 

^(xaa-Jia + 2(4)^3 = 2 :(Xi.r 3 ) 
a b^bl^ 

Here, just as in Chapter 5, the symbol M represents the mean %'alue of 
each variable, and the subscript indicates the particular variable. 

Similarly, the symbols S(.ria-„), and 2 ( 0 - 13 : 3 ) represent the sums 

of the products of the variables, corrected to adjust them to deviations 
from the mean; that is, ^(ariorj) = 2[(A"i — — M^]. Likewise 

the symbols 2 (a:|), etc., represent the sums of the squares of the variables 
also adjusted to deviations from the mean. 

Using the two basic formulas 

SCoTiXo) = S(;FiAy - nhf^M. (5.4) 

and 

S(4) = S( J|) - «(Afi) 

' These are the normal equations for two independent variables, corresponding to the 
normal equations for one independent variable given on p. 62 , in the footnote. 


( 11 . 2 ) 

(11.3) 



172 Determining Multiple Linear Regressions 

the other values shown m equations (11 2) may be worked out as follows 
- nM^M^ 

= nXiX^ - nM^M^ 

2(^ = s(-yD-«(A/D 

Computing the Extensions Inspection of these equations shows that 
there are eight arithmetic values which must be computed from the 
original data to work out the values to substitute m equations (11 2) and 
(1 1 3) These are XX^, S(^. X{A'|). S(Ari^j), and 

ZCA'gA'a) The work of computing these values for the farm-income data 
originally presented in Table 10 3 is shown in Table 11 1 [The value 

Table 11.1 

Computation of Values to Determine Multiple Regression Equation 
TO Estimate One Variable from Two Others 


is 876 
3 721 
3 481 

8100 
6.734 
7 744 
7 396 
S776 


5 776 
5 476 
9 604 
6400 


Sums 279 177 1 744 4 499 2 07J 24 341 2 243 16 795 157 532 

Means 13 95 8 85 87 2 

Adjusimeni item 3 89ZD5 246915 2432B80 I 56645 1543440 15207680 

Adjusted sums 606.95 —39415 14 20 676 55 1 360 60 5 455 20 


* In these coniputailoiia and X^ have beendivuJed bv 10 


2(A^) IS not needed in solving equations (11 2) or (II 3), but, as it will 
be needed later, it is also worked out here for convenience m calculation ]* 
After we have multiplied all the extensions shown m this table, and 
added each of the columns, our next step is to compute the values 
Ms, and Mi, by dividing the sums of each of the first three columns by 

* Alternative methods of making the computations and solving the equations, and of 
controllmgthecomputationstopreventerror,are$howninAppendix2, pp 489to520 




By Least Squares 

the number of cases. The adjustment item For each of the products is 
then computed and entered below the value from which it is to be sub- 
tracted. Thus the value below the sum of the fourth column. Z(a':), is 
its adjustment, This is equal to 20(13.95)% or 3.892.05, \\hich is 

the value entered. Similarly, the value below the sum of the fifth column, 
is its adjustment, nCAfaA/g), or 20(8.85)(13.95), which equals 
2,469.15. All the other adjustments are similarly worked out and entered. 
Then subtracting each adjustment from the value above it gives the values 
all ready for equations (11.2). Thus the value at the foot of column 4 
is the value for S(a|); and so on. When these values are substituted in 
the appropriate spaces of equation (11.2), they become 

(I) j 606.9562-394.1563=14.20 

(H) i:{x^^b2 + 2(a|)63 = 22^2:3] (-394.1563 + 616.55b;, = 1360.60 

Solving the Equations. The next step is to solve the two algebraic 
equations simultaneously to determine the values for 6, and 6,. 

One simple way to carry this through is by the Doolittle method. The 
first equation is divided by the coefficient of 63, with the sign changed, 
giving the first derived equation (I'): 

(I) 6O6.9562 - 394.1563 = 14.20 

(!') -62 + 0.6493963 = -0.02340 

Then equation (IT) is entered, and under it is written equation (I) 
multiplied by the coefficient of 63 in equation (I'), 0.64939. The sum of 
these two equations is then taken, eliminating the values in 63; 

(II) -394.1562 + 676.5563 = 1360.60 

(0.64939) (I) +394.1562 - 255.9663 = 9.22 

(Sn) 420.5963 = 1369.82 

(n') 63 = 3.25690 

As indicated above, tliis step gives the value of 63. This is then sub- 
stituted in equation (T) and the value of 62 determined; 

-62 + 0.64939(3.25690) = -0.02340 
62 = 0.02340 + 2.11500 = 2.13840 
The values of 62 and 63 being thus obtained, the next step is to^substitute 
them, together with the other values required, in equation (11.3) to work 
out the value for a: 

a = Ml — b^Mo — 63M3 
= 87.2 - (2.1384)(13.95) - (3.2569)(8.85) 

= 87.2 - 29.83 - 28.82 = 28.55 



/74 Determining Multiple Unear Regresiionz 

Estimating Xi from A'j and Having computed the values for 
a, bi, and 63, we can now wnte out our regression equation (11 1), with 
the best values, as determined by the mathematical calculation 

= 28 55 + 2 I384(^) + 3 25692r, 

Zi = 285 5 + 2 13842rj + 32 5692^3 
Comparing this equation with the last one obtained in Chapter 10, 
(page 167), we see that the least squares calculation has changed the 
1 33 allowed for the effect of each acre {b^ to 2 14, and increased the 
27 88 allowed for the effect of each cow {b^ to 32 57 Just what effect 
this has on the accuracy of the equation as a basis for estimating income 

Table Iia 


Actual Income and Income Estimated from Number of Acres and Cows, 
ON Basis of Mathematically Determined Relations 




Computation of Estimated 
Incomes 


Actual 

Income, 

Xi 

Actual 
Minus 
Estimated 
Income, 
Xi - XI 
z 

Xz ' 


Estimated 
for Acres 

biXi 

Estimated 
for Cows, 
bzXi 

Constant, 

a 

Income, 

60 

18 

128 

586 

286 

1 000 

960 

-40 

220 

0 

470 


286 

756 

830 

74 

180 

14 

385 

456 

286 

1,127 

1,260 

133 

80 

6 

171 

195 

286 

652 

610 

-42 

120 

1 

257 

33 

286 

576 

590 

14 

100 

9 

214 

293 

286 

793 

900 

107 

170 

6 

364 

195 

286 

845 

820 

-25 

110 

12 

235 

391 

286 

912 

880 

-32 

160 

7 

342 

228 

286 

856 

860 

4 

230 

2 

492 

65 

286 

843 

760 

-83 

70 

17 

150 

554 

286 

990 

1,020 

30 

120 

15 

257 

489 

286 

1,032 

1,080 

48 

240 

7 

51J 

228 

286 

1,027 

960 

-67 

160 

0 

342 


286 

628 

700 

72 

90 

12 

192 

391 

286 

869 

800 

-69 

110 

16 

235 

521 

286 

I 042 

1,130 

88 

220 

2 

470 

65 

286 

821 

760 

-61 

110 

6 

235 

195 

286 

716 

740 

24 

160 

12 

342 

391 

286 

1,019 

980 

-39 

80 

15 

171 

489 

286 

946 

800 

-146 




By Least Squares 


ITS 


from cows and acres may be judged by working out an estimated income 
for each of the 20 cases according to these last results, and then comparina 
the estimated values with the original values, just as was done before with 
the equations worked out by the appro.ximation method. The necessary 
computation is shown in Table 1 1.2. 

The operations that have been performed in this table may be mathe- 
matically stated as follows: 

First, an estimated value of income, A'j, has been worked out by sub- 
stituting in equation (11.1) the values for Xn and A'^^gis'en by each successive 
observation. Using the symbol A'^ to represent this estimated value of 
Xy it may be defined 

Xj = 0-1' hoXo -1- (>3X3 (11‘^) 

Each estimated income has next been subtracted from the corresponding 
actual income. With the symbol c used to represent the residual, the 
amount by which the actual value exceeds or falls below the estimated 
value, it may be defined 

2 = Xi-x; (11.5) 

The residual 2 has exactly the same meaning when the estimated values 
of the dependent variable are based upon two or more variables, using 
multiple regression, as it had previously when the estimate was based on 
a single variable, with simple regression. 

The accuracy of the last estimating equation, derived by an exact 
mathematical process, can now be compared with the accuracy of previous 
equations, obtained by a cut-and-try process. Computing the standard 
deviation of the residuals shown in this last table and comparing it with 
the standard deviations of the residuals worked out in Tables (10.7) and 
(10.9), we find the comparison to be: 

Standard deviations of residuals using various straight-line equations: 

First approximation equation, s. = 90.29 
Second appro.ximation equation, j. = 78.70 
Mathematically determined equation, s. = 70.48 

The equation determined mathematically gives closer estimates of the 
actual incomes from which it was derived than do cither of the two previous 
equations. This will always hold true. The mathematically determined 
equation gives once and for all the estimates of X^ which will make s. 
the smallest that can be obtained, assuming linear relations. The best that 
could be done by the approximation method would be to obtain the same 
conclusions as would be obtained by the other method. Tlic successive 
steps in Chapter 10 have shown how difficult it is to do this when the 
several independent variables are correlated with each other, and so tend 



176 Oetcfm/ning Muit/p/e Lfnear Regressions 

to vary with one another The mathematical method for determining the 
estimating equation, as illustrated in this chapter (or some alternative 
form of computation involving the same principle), has therefore been 
practically universally adopted as the standard way of determining the 
precise way in which one variable is related to, or may be estimated from, 
two or more variables related among themselves, if only straight-hne 
relations are to be assumed 

Nomenclature in Multiple Linear Regression When the constants 
of the estimating equation arc determined by the exact mathematical 
process, the equation is called a multiple regression equation, and the 
constants and *3 which show, in this case, the average increase m 
income (Ji^i) associated with unit increases in acres (X^), and cows (A's) are 
termed net regression coefficients The constant is termed the net 
regression of Xi on X^, holding X^ constant, and b^ is termed the net regres- 
sionofXionXs holding Xtconsfant Allthatthatmeansfor/ij.forexample, 
IS "the average change observed in Xi with unit changes in Xt, determined 
while simultaneously eliminating from any variation accompanying 
(hence temporarily assumed due to) changes in Xj ’’® 

In order that the mathematical notation for the net regression co- 
efficients may show quite clearly which independent vanables were held 
constant when a particular coefficient was determined, the subscripts under 
the b are sometimes more elaborate, showing first the dependent variable, 
then the independent variable whose effect is stated, then a period followed 
by the independent variables which were held constant in the process 
Thus the bt we have been using would be written b^ 3 The whole regres- 
sion equation would appear 

Xi ^ Oi2j + + &13a-^3 6) 

This notation serves to distinguish these net regression coefficients from 
those which would be obtained if additional independent vanables were 
included Thus if a third independent variable, say Xi, were also con- 
sidered, the equation would read 

Xi = <7i 234 + ^12 34^2 -h bj^nX^ + 644 23-^4 (N 7) 

For stiW another -vanaYAe * '«e«Ad \ a - 

^1 " 2345 + ^12 345-^* + ^13 8*5^3 + ^14 335-^4 + ^15 234-^6 (1 ^ 

The notation for a is changed as well as for each of the b’s , Oj 334 will 
probably be a different value from <14 a, just as ^42 34 is likely to be some- 
what different from 642 3 

* The tenn partial regression coefficient is used by some authors m place of net 
regression coefficient 



By Least Squares \n 

Determining a Regression Equation for Three Independent Variables. 
Solely ta illustrate the method, we may take the number of men on each 
of these 20 farms as given in Table 11.3 and work out an estimating 
equation considering men as well as acres and cows. (In practice, 20 
observations are often too few to determine, with a satisfactory degree of 
reliability, the net relations of one variable to three independent variables.) 

Table 113 


Computation of Additional Values to Determine Multiple Regression 
Equation, Adding a Third Independent Factor 


Item 

Number 

Acres,* 

^2 

Cows, 

Men, : 

^4 

Dollars 

Incpme,* 

A'l 


X3AO 

A'lT, 

-Vr 

1 

6 

18 

2 

96 

12 

36 

192 

4 

2 

22 

0 

3 

83 

66 

0 

249 

9 

3 

18 

14 

4 

126 

72 

56 

504 

16 

4 

8 

6 

1 

61 

8 

6 

61 

1 

5 

12 

1 

1 

59 

12 

1 

59 

I 

6 

10 

9 

1 

90 

10 

9 

90 

1 

7 

17 

6 

3 

82 

51 

18 

246 

9 

8 

11 

12 

2 

88 

22 

24 

176 

4 

9 

16 

7 

2 

86 

32 

14 

172 

4 

10 

23 

2 

3 

76 

69 

6 

228 

9 

11 

7 

17 

2 

102 

14 

3^ 

204 

4 

12 

12 

15 

3 

108 

36 

45 

324 

9 

13 

24 

7 

4 

96 

96 

28 

384 

16 

14 

16 

0 

2 

70 

32 

0 

140 

4 

15 

9 

12 

1 

80 

9 

12 

80 

1 

16 

11 

16 

3 

113 

33 

48 

339 

9 

17 

22 

2 

2 

76 

44 

4 

152 

4 

18 

11 

6 

1 

74 

11 

6 

74 

1 

19 

16 

12 

2 

98 

32 

24 

196 

4 

20 

8 

15 

2 

80 

16 

30 

160 

4 

Sums 

279 

177 

44 

1,744 

677 

40i 

4,030 

114.00 

Means 

13.95 

8.85 

2.2 

87.2 

613 80 389.40 3836.80 

1 96.80 

Adjustment items 



7n 11.60 193.20 

1 17.20 

Adjusted sums 









* Coded by dividing by 10. 


178 De^rwnlng Multiple Linear Regreszlons 

With the number of men designated as X^, the unknown constants 
to be determined are those given m equation (11 7), 01234, bi»3i, bun, 
and ^44 23 They can be obtained by the solution of the following set of 
equations 

^(^blS31 + 2(x223)ftj3j4 + £{*2^4)^1123 = £(*112) 

£(^2^3)^1331 + + £(^*4)^14 23 = ' (119) 

£(^2^4)^1334 "b £(^*^4)^324 + £(^)bl4 23 = £(*1^4) 

Ol 234 = jt/j — ^ 1234^^3 ^1324^/3 ~ 64423^/4 (11 10) 

Computing the Extensions All except 4 of the arithmetic values for 
equations (11 9) which need to be calculated from the original data have 
been worked out previously Only the values which involve ^"4, and its 
mean, are additional The new values needed are therefore M^ £(a:ir4), 
£(3:3X4), and £(3:f) The computation of these values is shown 
in Table 1 1 3 

All the calculations, including correcting for the means at the end, are 
earned out just as in Table 1 1 I The figures at the foot of each column 
provide the remaining values necessary to write out equations (119) 
in full For convenience in writing these equations, we shall again use 
the abridged notation of bj ^1234 ^3 ^i3 24» remembering, 

however, that 63 here is a di^erent constant from previously 

( 1 ) £(*|)ba + £( 312 X 3)63 ] [ 606 9562 -394 15*3 

+ £(3:2X4)64 sa 'L(xiX^ + 63 2O64 ss 14 20 

(n)£(xgX3)62 + £(xi)63 _ -394 1 562 + 676 5563 

+ £(a‘s^i^4= £(2^1*3) 1 +11 6064=136060 

(III) £(3:4X4)62 + £(X3X4)63 6 3 20*2 + 1 1 6O63 

+ £(x1)64 =£(x,x4) +172064=19320 

Solving the Equations The three equations are now to be solved 
simultaneously to determine the values for 62, 63, and 64 This can be 
done by the usual algebraic processes, but the peculiar symmetrical 
character of the equations, which the attentive reader has probably already 
noticed^ makes it possible tx> use a muebL shottet melbod Svorie tie savvaij, 
in clerical labor by the use of this method is quite significant, it will be 
shown in full 

The first step is to set down the first equation (I) and divide it by the 
coefficient of the first term, £a|, ynth the sign changed, or —606 95 in 
this case The resulting derived equation (!') is set down just below it 
(I) 606 95*2 - 394 1563 + 63 2O64 = 14 20 

(I') -62 + 0 649396^ - 0 10413*4 = -0 02340 



m 


By Least Squares 

The next step is to set down the second equation (II). The first equation 

(I) is then multiplied by the coefficient of the second term in the derived 
equation (I'), which is +0.64939 in this case, and the products set down 
just below equation (Ilj. These two equations arc added, giving the 
sum equation (Lj). which cancels out the first term, as shown below. 
Tbe sum equation is then divided by the coefficient of its first term, with 
the sign changed, giving the second derived equation (IT). The second 
portion of the work now appears as follows: 

(II) -394.1562 + 676.5563 + II. == 1360.60 

(0.64939) (I) 394.1562 - 255.9663 + 4].046< = 9.22 

(S,) 420.5963 + 52.646^ = 1369.82 

•(IT) -63 - 0.125166^ = -3.25690 

The final step in the process of elimination is to write dowm equation 

(III), multiply the first equation (1) by the coefficient of the third term of 
the first derived equation (T). which is —0.10413 in this case, and set 
the products down below equation (III); multiply the sum equation {—3) 
by the corresponding coefficient (the second term) from the second derived 
equation (IT). —0.12516; and set these products down below the previous 
equation. Equation (III) and the two new equations are then added, 
giving an equation (S3), from which values in both 63 and 63 have been 
eliminated. This equation is then divided by the coefficient of its first 
term, with the sign changed, —4.03 in this case, and the resulting new 
derived equation entered as equation (IIT). (A method of checking each 
step in these computations is shown in Appendix 2. p. 493.) All the 
computations to this point are; 


(I) 

6O6.9562 

- 394.1563 

+ 63.2064 = 

14.20 

(I') 

-6-2 

+ 0.6493963-0.1041364 = 

-0.02340 

ai) 

-394.1562 

+ 676.5563 

+ 11.6O64 = 

1360.60 

(0.64939) (1) 

394.1562 

- 255.9663 

q- 41.046. = 

9.22 

(S.) 


420.59&2 

+ 52.6464 = 

1369.82 

(11') 


— 

-0.1251664 = 

-3.25690 

(III) 

63.206, 

+ 11.6063 

! + 17.2064 = 

193.20 

(-0.10413) (1) 

-63.206, 

+ 41.0463 

-6.5864 = 

-1.48 

(-0.12516) (Sj) 


—52.6463 

- 6.5964 = 

-171.45 

(S3) 


j 

4.0364 = 

20.27 

(III') 



! -64 = 

-5.0297S 



180 Determining Multiple Linear Regressions 

It IS now very easy to compute the values of b^, 63, and from the 
three derived equations From equation (III'), 64 = 5 02978 
Substituting this value in equation (II'), which may be transposed to 
read 

63 = 3 25690 - 0 12516*4 

we find 

*3 = 3 25690 - (0 I2516)(5 02978) 

= 3 25690 - 0 62953 = 2 62737 
Then, transposing equation {!'), we find 

*2 = 0 02340 + 0 64939*3 - 0 10413*4. 
and substituting the values for *3 and *4, 

*2 = 0 02340 + (1 70619) - (0 52375), 

we find 

*2 = 1 20584 

The values of *2, by and *4, just computed, may next be verified by 
substituting them in the last equation (111) Equations (I) or (II) should 
not be used for this lerificouon, since they hiII not provide a complete 
check Equation (III), 63 20** + II 60*3 + 17 20*4 = ^93 20, which 
becomes, when the newly calculated values are substituted, 

(63 20)(1 20584) + (H 60)(2 62737) + (17 20)(5 02978) = 193 20 

This works out to 76 21 + 30 48 + 86 51 = 193 20, or 193 20 = 193 20 
This proves the accuracy of all the previous work 
The work just summarized is all that is needed to solve these three 
simultaneous equations In view of the way the terms cancel out dunng 
the second and subsequent steps of the process, the work can be still 
further simplified by omitting all entnes to the left of the solid line which 
has been drawn in through the last full set of computations 
Having calculated the values of the three *’s, we can calculate a very 
readily 


a — Ml — bjM^ — — ^4^4 

= 87 2 - (1 20584)(I3 95) - (2 62737)(8 85) - (5 02978)(2 20) 
= 36 06 



I8f 


By Least Squares 

The regression equation for the three variables is therefore 

(m) "" 2.62737A3 + 5. mnx \ 

If we clear the fractions, the equation becomes 

ATi = 360.60 + 1.20584 AT, + 26.2737A'3 + 50.2978^^4 

Using this equation, we may work out values of and of = just as 
we did previously. (This will be left as an exercise for the student. Is 5. 
for the new estimates larger or smaller than for the previous estimates? 
Why should it be?) 

Interpreting Net Regression CoEmciENTs. It should be noted that 
though the value of 1.20584 for just determined, compares with the 
value of 2.13840, for 612 determined previously, they do not measure 
exactly the same thing. The coefficient hj; 34 shows the average increase 
in income for each acre increase in size of farm, v,ith both the number 
of cows and the number of men remaining unchanged. The coefficient 
^12.3 shows the average increase in income for each increase of one acre 
in size, with the number of cows remaining unchanged, but without making 
any allowance for differences in the number of men. Apparently a 
considerable portion of the differences in income which on the earlier 
analysis would have been ascribed to the additional acreage is shown by 
this more complete analysis really to have been associated with the larger 
labor force on the greater acreages, rather than to the greater acreages 
themselves. This result illustrates one property of net regression coeffi- 
cients in common with all other regression results. They ascribe to any 
particular independent variable not only the variation in the dependent 
variable which is directly due to that independent variable but also the 
variation which is due to such other independent variables correlated 
with it as have not been separately considered in the study. In the same 
way that acres, taken alone, included part of the effect due to cows, the 
effect of acres eliminating cows still included part of the effect due to 
men; and even the effect of acres holding constant the effect of both 
cows and men may still include variation due to other variables correlated 
with them, such, for example, as fertility of the land. These considerations 
illustrate the extreme care which is necessary in examination of the data 
and the theoretical analysis of the problem before deciding on the variables 
to be correlated, and the caution which must be employed in interpreting 
the results. 

Determining the Regression Equation for Any Number of independent 
Variables. The same mathematical principle which has been used to 
determine the constants for regression equations involving one, two, or 



182 Determining Multiple Linear Regressions 

three independent variables can be extended to problems involving any 
number of variables it may be desired to employ 

For four independent variables the equations are 

2 ( 12)^12345 + 2](^a^)6i3 24$ + ^{^ 2 ^ 4)^11233 

+ 2(x2r5)his234 = Stiiarg) 

SIS + 21(aD6i3 245 + 4)^14 jjj 

+ ^(^^s)^lS234 “ S(xir3) 

i:(x2Zi)b^i 345 4- 2(r3ar4)6 13 215 + 235 

+ 21(X4X5)6i 5 234 = 2(XiZ4) 

2^(*2®5)^12 315 + ^(%*s)^13 245 + 235 . 

+ S(a|)6,5 23, = S{XiXs) J 

®1 2315 ^ ~ ^12 345^^2 ■" ^13 215^3 235^4 *“ ^15 231^^5 (I I 12) 

When this set of equations is compared with equations (1 1 9) for three 
independent variables, it is evident that adding the additional variable, 
A’j, has made it necessary to add the additional equation, m which Xg 
appears m each of the product terms, and also to add an additional term 
to each of the previous equations, the additional term including a product 
summation [such as and m which Xg appears, and also 

the net regression coefficient 615334 equation to compute a has also 
been extended by adding the term —bn 3346/5 In the same way the equa 
tions to be solved to determine the constants for any number of variables 
can be built up, if it is remembered that for each variable added a new 
terra must be added to each of the previous equations and a new equation 
must be added each term added including the new variable in some way 

The products which must be computed for any given set of variables, 
and the equations which will need to be solved, may be worked out readily 
by the use of the following scheme 

Write out the required regression equation (m terms of deviations from 
the mean) as, for example, for six variables 

63 X 3 + 64 X 4 + 65 X 3 + bgXg = xj 

Multiply each term by the coefficient of the first unknown (that is, 
by X 2 ) and sum This gives the first of the required equations 

2 (X 2)62 + 2(X2X3)63 + 2(X8X4)*4 + S(X 2 X 5)65 + 2{X2Xe)65 = SfXaXj) 

Then multiply by the coefficient of the second unknown (xg) and sum 
The second equation is, therefore, 

S(x^3)62 + 2(x|) 63 + S(X:^4)fi4 + 5X13X5)65 + S(x3X5)6e = SXxgX,) 




By Least Squares /gj 

The same process is carried out for the coefficient of each tinkno’-sn in 
turn, giving five equations to be sohed simultaneously to determine the 
values for the five unknowns. Setting up these equations mav be reduced 
to a tabular form, as in Table 1 1.4. 

The variables to be considered are listed at the head of columns from 
the left to right, ending with the dependent variable at the right. Then 
the independent variables are entered down the beginning of the lines 


Table 11.4 

Form for Working Out the Equations to Derivx 
Net Regression Constants 


Indtpcnicnt 

Vanatlta 

1 fndcpendtnt %'anibl;i (in Dof. 2 ticr:s frt>*n NTcs-n;) | 

Varab’e* 

Xj 

] 

! 1 

i 

’’i ' 

1 

1 1 


Xj 


XO-sTjUi, 








ECrJTjIS, 


-OjT.V*-, 





= -tTjTjl 




X<zpb, 














=- 


ro'-iTj).'-. 











Str,T,W, 





•• tCCrjr,) 

-s 



SXZfZ.tb, 

i 

1 




« StrjX.I 


at the left in the same order. The cells of the table are then filled by multi- 
plying the variable at the head of the column by the variable at the end 
of the line. These products indicate the values to be computed [by 
equations (5.2) and (6.4)J, to give the arithmetic values for the equations. 
The b terms represent, of course, the net regression coefficients for the 
particular number of variables concerned; that is. b^ would be for 
two independent variables, from three independent variables, and 
so on. The illustration is carried out to seven independent variables, 
but the scheme can be extended to as many as it is desired to consider. 

The equation to compute a is simply the value of the mean of the 
dependent variable, minus the product of the mean of each independent 
variable multiplied by the coefficient for the net regression of the dependent 
variable on that independent variable. 

.As a matter of practical procedure, it is seldom that a problem is so 
complicated or that enough observations arc available so that significant 
results for each variable will be obtained using ten or more variables; 
and, ordinarily, analyses involving not more than five variables are all 
that will yield stable results. Various methods for simplifying the necessary- 
calculations in carrying through a problem involving a large number of 
observations are presented in Appendix 2. 

Interpreting the Multiple Regression Equation. The same limitations 
apply in interpreting regression coefficients worked out with the eftect 











184 Determining Multiple Linear Regressions 

of one or more variables held constant as when only two variables are 
considered Thus for the data shown in Table 113 there were no obser- 
vations with more than 18 cows, or 4 men, and none below 60 acres or 
above 240 acres For that reason, there is no basis for using the regression 
equation to estimate income beyond those limits Furthermore, for the 
extreme ranges where only a few observations were available — for example, 
less than. 80 acres — the relations could not be 'toepected to hold as well as 
where there were more observations upon which to base the conclusions 
In Chapter 17 a more definite basis for determining the probable accuracy 
of such estimates is discussed, together with ways of working out the 
confidence intervals for each constant which appears m the regression 
equation For the present the caution may be restated, that the results 
may be expected to hold true only within the range covered by the bulk 
of the observations upon which they were based * 

The meaning of the regression equation 

= 360 60 + I 21 Xi -1- 26 27 X 3 + 50 30Xj 
may be made clearer, m publishing correlation results, by working out 
the estimated values for a representative variety of conditions Such a 
statement of the conclusions covered by the previous regression equation 
would be made as in Table 1 1 5 

Table 11.5 

Averaoe Income on Farms With Varying Numbers of Acres, Cows, 
AND Men 

(As indicated by regression analysis) 


Labor 

Force 

Income on Farms with 

100 Acres 

Income on Farms with 

160 Acres 


0 Cows 

8 Cows 

16 Cows 

0 Cows 

8 Cows 

16 Cows 

Men 

Dollars 

Dollars 

Dollars 

Dollars 

Dollars 

Dollars 

I 

532 

742 

952 

* 

• 

• 

2 

* 

792 

1 003 

655 

865 

• 

3 

* 

* 

1.053 

705 

915 

1,125 


• Omitted because of absence of observations representmg this combination of factors 


* Even within the limits of the range of observiUions of each variable taken separately, 
there may be combinations of values of independent variables which are not represented 
by the data, cither exactly or even approximately Estimates for such combmations 
will have less reliability than for those combinations which are represented For a fuller 
discussion of this source of unrehability, see Chapter 19 



8y Least Squares /g 5 

It should be noted in Table 1 1.5 that, according to these results, increas- 
ing the number of men from 1 to 2, or from 2 to 3. will add S50 to income, 
no matter whether the farm has 100 acres and 8 cows, or 160 acres and 
16 cows. Similarly, adding 8 more cows is indicated as having the same 
effect on income, no matter how large the farm is or how many men arc 
employed. But that this conclusion has been reached is no proof that it 
is really true of the universe represented by the original data. Instead, 
such a conclusion is inherent in the linear equation (11.6, 11.7, or 11.8) 
which has been used. That equation necessarily assumes that an increase 
of one unit in any one independent variable will always be accompanied 
by an equal change in the dependent variable. Only insofar as the actual 
facts agree with that assumption can they be represented by a linear 
equation. Subsequent chapters (particularly 14, 16, and 21) take up 
methods of analysis which may be employed when this ty'pe of relation 
is not true, and the linear equation is therefore unable to express the facts 
adequately. 

Net regression coefficients, computed from a sample, may vary more 
or less widely from the true values for the universe from which that sample 
is drawn. Tests to indicate the reliability of such sample results are given 
in Chapter 17. They should always be calculated and considered before 
generalizing from such sample results. 

Use of Card Tabulators or Electronic Computers to Perform the 
Operations. If the samples are large, calculating the extensions, as 
shown in Tables 11.1 and 11.3, involves a good deal of hand computing 
even with a calculating machine. Card tabulators can speed up this 
process especially if they are equipped with automatic multiplying devices.® 
In the latter case, after the equations have been solved, the tabulators can 
also calculate the estimated values of for each observation, and compute 
the residuals 2 , as shown in Table 11.2. Electronic calculators can perform 
all these operations, and also can solve the equations for any number of 
variables within the scope of the machines, by developing a suitable set of 
instructions to the machines.® 

‘ D. H. W. Allan and R. F. Attridge, The application of an IBM calculating punch 
to solve multiple regression problems. Proceedings Seventh Annual Conference of the 
American Society for Quality Control, pp. 521-33, 1953. 

• J. A. C. Brown, H. S. Houthakker, and S. J. Prais, Electronic computation in 
economic statistics, Journal of the American Statistical Association, Vol. 48, pp. 414-2S, 
September, 1953. 

K. D. Tocher, Application of automatic computers to statistics. Automatic Digital 
Computer, pp. 166-78, Her Majesty's Stau'oner)' Office, London, 1954. 

Gordon Spencer, Statistics and automaUc co.mpulers, Computers and Automation, 
Vol. 4, No. I, pp. 6-7, January, 1955. 

D. A. Quarles, Jr., Operating notes, D F EOd (program designed to handle multiple 
(footnote continued on page 186) 



186 Determining Multiple Lineor Regressions 

Since many research workers will not have such equipment available 
to them, full suggestions on ways to control errors and simplify the 
procedures m doing the work by hand are given m Appendix 2, and are 
summarized m Chapter 13 for one of the most efficient computing forms 
These checking devices can also be used with mechanical or electronic 
equipment, to insure against any false functioning 

The availability of automatic or electronic equipment may lead the 
investigator to try out a large number of different combinations of 
independent variables, or to use solutions involving a very large number 
of variables simultaneously, in order to obtain the best fit, without regard 
to the logic of the equations employed This may mislead him into 
obtaining results of little reliability (see page 436), or of making analyses 
without really studying and understanding the various senes with which 
he is dealing This temptation is thus a drawback to the case of computa- 
tion provided by modem computing machmes 

Summary 

This chapter has presented mathematical methods for determining the 
constants of a multiple linear regression equation, so that changes in 
one variable may be estimated from changes in two or more independent 
variables Equations so determined afford a more exact basis for making 
such estimates than do linear equations obtained by any other method 


regression and correlation analysis problems on machine 701), IBM N Y Data 
Processing Center, 701 Program Library, August 1955 

■ , Description of the printed output of the IBM 701 Multiple Regression and 

Correlation Analysis Program, E 03, IBM N Y Data Processing Center, 701 Program 
Library, August, 1955 

S J Prais and H S Houthakker, The analysis of family budgets DAE Monograph 
4, Cambridge University Press, 1955 

J Aitchison and J A C Brown, The lognormal distribution DAE Monograph 5, 
Chapter 13, Cambridge University Press 1957 
F S Beckman and D A Quartes, 3r , MuUiple regression and correlation analysis on 
the IBM type 701 and type 704 electronic data processmg machines. The American 
Statistican, Vol JO, No 1 pp 6-9, February 1956 

, Multiple Regression and Correlatron Analysis, IBM N Y Scientific Com 

puting Center, NY MR J, I/II, and Machine Operating Notes, 1/7, November, 1956 

, Multiple Regression and Correlation Analysis, IBM SBC N Y Data 

Processing Center, 704 Program Library, NY MR 2, April, 1957 
E J Laurie and R W Heald, Dam ftwesnn? Bi6/<c>?rapAy, San Jose State College, 
1958 

Malcolm R Fisher, A sector model — tbepoultiy industry of the U S , Econometnea, 
Vol 26, No 1, pp 37-66, January, 1958 (This is a large scale multiple regression 
investigation, using electronic computation ) 




187 


By Least Squares 

Furthermore, the multiple regression equation screes to sum up all the 
evidence of a large number of observ'ations in a single statement which 
expresses in condensed form the extent to which differences in the de- 
pendent variable tend to be associated with difierenccs in each of the 
other variables. 

REFERENCES 

Tolley, H. R., and Mordecai Ezekiel, A method of handling multiple correlation 
problems. Quart. Pub. Amcr. Slat. Assoc., pp. 994-1003, No. 144, Vol. XVUl. 
December, 1923. 

Ezekiel, Mordecai, The assumptions implied in the multiple regression equation, 
Jour. Amcr. Stat. As.wc., No. 151, Vol. XX, pp. 405-8, September, 1925. 

Wallace, H. A., and George W. Sncdccor, Correlation and machine calculation, Iowa 
State College Bui. 35, 1925. 

Friedman, Joan, and Richard J. Foote, Computational methods for handling sy.stcms 
of simultaneous equations, U.S. Dept, of Agr., Agr. ,\lktg. Scr\'., Agriculture 
Handbook No. 94, 109 pp.. November, 1955. 

Williams, E. J., Regression Analysis, John Wiley &. Sons, Inc., 1959. 

Acton, Forman S., Analysis of Straight-Line Data, John Wiley & Sons, Inc., 1959. 



CHAPTER 12 


Measuring accuracy of estimate 
and degree of correlation for 
multiple linear regressions 


Standard Error of Estimate. After working out equations by which 
values of one variable may be estimated from those for two or more 
independent variables, « is frequently desirable to have some measure 
of how closely such estimates agree with the actual values and of how 
closely the vanation in the dependent variable is associated with the 
variation in the several independent variables Attention has been called 
in the preceding chapters to the computation of the residuals, z, when 
the value of a variable is estimated from that of several others Where 
the estimate is based on several independent vanables the standard 
deviation of these residuals serves as a measure of the closeness with 
which the original values may be estimated or reproduced just as well as 
where the estimate is based on a single vanable Continuing the same 
terminology as before, this standard deviation is still called the “standard 
error of estimate *’ Thus for the regression equation for estimating income 
from known numbers of acres, cows, and men, the standard error of 
estimate is designated 234 The subscnpts j *34 indicate that that is the 
standard error for vanable Xi when estimated from the independent 
vanables A'j, X3, and X, 

Where the size of the sample is small m proportion to the number of 
vanables involved, the standard deviation of the residuals for the cases 
included in the sample tends to have a downward bias That is, it tends 
to be smaller than the standard error which would be observed if the 
same constant were computed from large samples drawn from the same 
universe 

For that reason it is necessary to adjust the square of the observed 
standard deviation of the residing, before it will give an unbiased 
I8B 



Accuracy of Estimate and Degree of Correlations 

estimate of Ihe square of the value of the standard error of estimate for 
estimates made for new observations drawn from the same universe. 
This adjustment is: 

where n = number of sets of observations in the sample, 

m = number of constants in the regression equation, including 
a and the b's. 

(Where the adj^ted value for 5f.23i exceeds the value of if, the latter 

^value shoulH^ejised for the stand ard error.) - - - ~ ~ ' 

The standard errors for the equations obtained when one, two, and three 
independent variables were considered in the farm-income study in 
Chapter II may be summarized as follows: 


Independent Variables 

Observed s. 

n 

m 

Adjusted Standard Error 

X. 

165.15* 

20 

2 

Sio - 165.15 

Xi,X, 

70.48 

20 

3 

“ 76.45 

X 2 , Xs. Xi 

66.77 

20 

4 

~ 74.65 


• This value has not been shown previously. It is calculated from the data of 
Chapter 11. 


(In this case the correlation between and is practically zero, so 
s. = jj. Under the rule given above, = ii.) The values tabulated in 

the last column illustrate the increase in the reliability of estimate as 
additional variables are taken into account. 

So far, the standard errors of estimate (except for simple or two- 
variable regression) have been determined by actually working out all 
the estimated values, subtracting to get the individud residuals, z, and 
then determining their standard deviation. For linear multiple regression 
equations, however, a much simpler process can be used. To compute 
the standard deviation of the residuals by this process, all that is required 
in addition to the values which have been used in computing the 6’s 
is the value, 2(af). The formula is as follows: 


P2 

*^J.234...jn 


m) - [^12.34 . . . + ^13n4...jnC^^l^ 

^ + ... + . 

n ~ m 


( 12 . 2 ) 



190 Determining Multiple Linear Regressions 

Substituting the values for the regression equation computed with two 
independent variables, p 172 and 174, the equation becomes 

- l&l* s(2*1*2) + ^13 2(S*1*3)] 

‘!»i^ 

In terms of coded values for Jf,, 

^ 5,455 20 - (2 1384)(I4 20) - (3 2569)(1.360 60) 

10* 20-3 

/ill» = 7 645 

10 ^ 17 

5j2a = 76 45 

The result is seen to be identical with the value computed (after adjust- 
ment) by the lengthy process illustrated in Table 11 2, of working out all 
the individual estimates, computing their standard deviation, and then 
adjusting by equation (12 I) 

Multiple Correlation The standard error of estimate for a multiple 
regression equation, just as with simple regression, measures the closeness 
with which the estimated values agree with the original values The 
standard error, however, offers no measure of the proportion of the varia- 
tion in the dependent factor which can be explained by, or is associated 
with, variation m the independent factor or factors For example, m 
one area the farm income might be twice as variable as m another If 
two or three independent factors such as those discussed came as near 
accounting for all the variation m incomes m one area as in the other, 
the standard errors of estimate would be the same in both cases There 
was originally more variance m income in the one case than m the other, 
therefore with the same amount left unaccounted for the independent 
factors would have been associated with a larger proportion of the original 
variance, in the case where it was largest to begin with, and would have 
been relatively more important in that case In simple regression, the 
relative importance of the independent factors was measured by the ratio 
of the standard deviation of the estimated values to the standard deviation 
of the actual values, and the name coefficient of correlation was given 
to this ratio In exactly similar manner, when the estimates are based 
on several variables, instead of on one, the relative importance of all 
those vanables combined may be measured by dividing the standard 
deviation of the estimated values by that of the original values This 
ratio IS named the coefficient of multiple correlation, since it measures 
the combined importance of the several independent factors as a means of 



191 


Accuracy of Estimate and Degree of Correfot/on 

explaining the differences in the dependent factor. It is a useful measure 
when the sample is constructed on thc“corrclation modcr’(sec Chapter 17). 

Using to designate the estimates of Xj made from variables A;, A'-, 
and X 4 , and to represent the unadjusted coefficient of multiple 
correlation, the coefficient may be defined: 

^1 = ^J.234 + ^12.31-^2 + ^13.21^3 + (12.3) 

-1^1.234 = •“ (12.4) 


The same short formula which has been sho^\■n for computing the 
standard error of estimate may be employed to facilitate the computation 
of the coefficient of multiple correlation, using only values already involved 
in equation (12.2). The equation for computing the coefficient of correla- 
tion by this method is:^ 


K 


1.234 ...m 


^ 12.34 . . . 

== I +...+ ^m.23...(m-l)(S^l=^r.) i 

2(xt) 


(1 2.5) 


The square of the coefficient of multiple correlation, R-, may be termed 
the coefficient of multiple determination. 

The same relations hold between the coefficient of multiple correlation 
and the unadjusted standard error of estimate in the case of multiple 
correlation as in the case of simple correlation. For that reason, one of 
these measures may be computed from the other, whichever is determined 
first, according to the following equations; 

I - ('2«) 

^ 1.234 . . . m ~ •T(^ -^1.234 . . . rt) (1 2.7) 


Using equation (12.6) to compute the values of R from the values of S, 
the multiple coefficients for the three regression equations previously 
worked out may be stated in the following different ways; 


Dependent Independent Variab]e(s) 

Standard 

Coeffident 

Coefficient 

Variable, 

Income 

Acres 

Cows Men 

Error of 
Estimate 

of Multiple 
Correlation 

of Multiple 
Determination 

^1 

a; 


165.15 

0.008* 

0 


Xs 


70.48 

0.904 

0.818 

A'l 

X. 

2r3 

66.77 

0.915 

0.837 

* Simple correlation coefficient, r,.. 





1 This may be computed conveniently by following the form shown on pp. 496 or 512. 



192 DeUrmfnIng Multiple Linear Regressions 

It IS evident that the correlation locreascs as the standard error decreases 
Here the residual vanation m each case is being compared with the same 
original standard deviation, so that that necessarily follows Where 
different studies are being compared, however, such as two samples with 
widely different original deviations in the dependent vanable, the standard 
error of estimate would not nccessanly decrease as the correlation 
increased, since the former is an absolute measure whereas the latter is a 
relative measure 

If such a statement is to be made as “75 per cent of the variance in 
income was associated with (or related to) vanances in numbers of acres 
farmed, of cows milked, and men hired,” it is more accurate to use the 
coefficient of multiple determination than to use the coefficient of multiple 
correlation The latter would overstate the case This principle holds 
true both for simple correlation (r) and multiple correlation (/?) the 
square of the coefficient indicates the proportion of the variance in the 
dependent variable which has been mathematically accounted for, 
whereas, 1— the square of the coefficient indicates the proportion which 
has not been accounted for ’ 

The coefficient of multiple correlation, Kj may also be defined 
as the simple correlation between the actual Xi values and the X' values 
estimated from the several independent factors 

For convenient methods of calculating the various measures discussed 
in this chapter, see Appendix 2, pages 489-520 

Where a parabola or other algebraic equation has been fitted as a 
regression curve by least squares, the index of correlation may be computed 
by equation (12 5), and the standard error of estimate by equation (12 2) 
or (12 7), using the several terms m the fitted equation, X, AT*, etc , m 
place of A'j, X 3 , etc 

Aleasuring the 5eporote E^ect of Individual Variables. In addition 
to the measures of the importance of all of the independent variables 
combined, it is sometimes desirable to have measures of the importance 
of each of the individual vanabics taken separately, while simultaneously 
allowing for the variation associated with remaining independent vanables 
There are two different types of these measures the coefficient of partial 
correlation and the beta coefficient 

Partial Correlation Coefficients of partial correlation measure 
the correlation between the dependent factor and each of the several 
independent factors, while eliminating any (linear) tendency of the remain- 
ing independent factors to obscure the relation Thus in the problem 
where income was correlated with numbers of acres, cows, and men, 
the partial correlation of income with acres, while holding constant cows 

* See Note 2, Appendix 3 



Accuracy of Esa'mate and Degree of Correlation /93 

and men, indicates what the average correlation would probabiv be 
between acres and income in samples of farms in which all the farms in 
each sample had the same number of cows and the same number of men. 

If an average of this series of correlations was calculated,’ it would 
correspond to the partial correlation of income with acres, while holding 
cows and men constant (rj 2 _si)- A similar interpretation can be made for 
the other two partial correlation coefficients. Even in problems (such as 
the present one) where the number of observations is not sufficient to 
permit of many such subgroups being formed, the partial correlation 
coefficient indicates about what such an average correlation in selected 
subgroups would be, if computed (rom a larger sample drawn from the 
same universe. 

Any group of independent variables may scree to explain some, but 
not all, of the variation in a dependent variable. If an additional inde- 
pendent variable is added, it may also account for part of the variation 
left unexplained by the factors previously considered. The coefficient 
of partial correlation may be defined as a measure of the extent to which 
that part of the variation in the dependent variable which was not explained 
by the other independent factors can be explained by the addition of the 
new factor. For example, in the farm-income problem, considering only 
acres and cows, the correlation was = 0.904. When acres, cows, and 
men were considered, the correlation was = 0.915. Squaring both 
values shows that, w'hereas the two variables explain 81.8 per cent of the 
variance in income, the three variables explain 83.7 per cent. Whereas 
18.3 per cent of the variance is left to be explained when the two variables 
are considered, only 16.3 per cent is left to be explained when three are 
considered. Adding the additional variable has increased the variance 
which can be explained by the difference between these two figures, or 
2.0 per cent (18.3-16.3). If the importance of this increase is determined 
by comparing it ivith the variance left unexplained before the new variable 
was added, we find that 2.0/18.3, or 10.93 per cent of the variance 
left unexplained by acres and cows, has now been found to have been 
associated with differences in numbers of men. Talcing its square root 
gives the coefficient of partial correlation, 0.33. 

The coefficient is designated since it shows the partial correlation 
between and after X, and X^ have been taken into account. As is 
indicated in the discussion, its square may be computed by the formula'* 
o (I — Ajes) — (1 — 

’ The calculation of the average of a scries of correlation coefficients would involve 
the use of Fisher s r-transformation. 

* This is different from the formula customarily given. Sec Note 3, Appendrx 3. 



194 


Determining Multiple Linear Regressions 


For purposes of computation, this formula may be simplified to 

02 8 ) 

• “ -^123 

If It IS desired to compute coefficients of partial correlation for the 
other independent variables, acres and cows, the corresponding formulas 
are 



In each case, the coefficient should be given the same sign as the 
corresponding net regression coefficient That is, if is negative, 
ri 3 24 will also be negative 

It should be noticed that, although the numerator of the fraction is the 
same in each case, the denominator is different This is a peculianty of 
coefficients of partial correlation — they measure the importance of each 
of the several variables by determining how much it reduces the variation 
a/ier all the other variables except it are taken into account 
If we work out the new multiple correlations necessary,® 24 and 34 , 
and substitute them m the equations given just above, the whole set of 
coefficients of partial correlation and partial determination for the farm- 
income problem works out as m the following equations and in Table 12 1 


^3 24“ 


I - 0 837 
1 -0633 


= 0 556 


1 - 0 837 
1 -0804 


= 0 168 


* The two new coefficients of multiple correlation are obtained by rearranging the 
arithmetic values previously computed so as to give the necessary regression coefficients, 
and then determining the value otRhy equation (12 5) The two new sets of equations 
are 

To determme Ri 3 , 

, = (Sxjx,) 

Similarly, for Ri 1 ,, 

{XxJ)A„ 4 + ff*»x4)A,4 , = (Sxjx,) 

(Ex,X4)A„4 + , *= (SX1X4) 

For a method of computation whidi facilitates the calculation of the partial r s, as 
•well as of the earlier values, see Chapter 13, and Appendix 2, pages 503-506 



Accuracy of Estimate and Degree of Correlation 

When income v<as correlated v.ith acres alone, there ^^a 5 %irtualiv no 
correlation (r^, = 0.01). Yet the partial correlation of income vdih acres, 
while holding constant the variation associated with cows and men, has 
just been seen to be 0.41 . Although this is not high, it is certainlvmorc than 
no correlation. Furthermore, even though the correlation of income with 
cows alone is 0.71. the correlation with both acres and cows is 0.90. 


Table 12.1 

Relata'e Importance of I.vdivtdual Factors Affecttno Ivoomf, 
AS Int>icated by Coefficients of Partial Correlatios' 


Factors Already 
Considered 

Factor 

Added 

Coefficient of 
Partial 
Correlation, 
r j 2 . 3 n etc. 

Reduction in 
Unexplained 
Variance, 
etc. 

Cows (Yg), men (X^) 

Acres (Y,) 

0.41 

0.1 6S 

Acres (Yo), men {Xf) 

Cows (Y 3 ) 

0.75 

0.556 

Acres (Yj), cows (Yg) 

Men (Y 4 ) 

0.33 

0.109 


On the surface of the data there appears to be no relation between 
acres and income, since the positive relation of acres to income is hidden. 
Acres are negatively correlated with cows to a sufficient extent so that 
the decreased income with decreased number of cows olTseis the increases 
with more acres. Only when the number of cows is allowed for can the 
influence of acres be seen. 

It is evident that a mere surface examination of a set of data cannot 
reveal which independent factors are important and w hich arc unimportant. 
A variable which shows no correlation with the dependent variable may 
yet show significant correlation after the relation to other variables has 
been allowed for. 

Investigators sometimes think they arc doing “research" when they 
study the relation of a given variable, say the price of a commodity, to 
a number of other factors, discard all those factors that show no correlation 
with price, and select for further study by multiple correlation the factors 
that show the highest simple correlation with the price. As (he preceding 
discussion show^s, that procedure may result in discarding factors which 
would show a truly important relation to price after the effect of other 
associated factors had been allowed for. A careful. logical e,xamination 
of the problem, the selection of the factors to be considered on the basis 
of these qualitative considerations, and then prcliminaiy examination Oi 


196 Determining Multiple Linear Regressions 

all the inlercorrclations among the scl«;ted independent factors, v,in 
provide more trustworthy results (See Chapter 26 for a more detailed 
discussion of the places of qualitative and quantitative analysis in such 
studies ) 

The test whether a given independent vanable may really be related to 
the dependent vanable, even if U shows no apparent correlation, is whether 
that independent vanable is correlated with other independent variables, 
which in turn are correlated with the dependent Thus in the example 
just discussed, although acres showed no correlation with income, they 
did show significant correlation with cows If acres had had no correla- 
tion with either income, cows, or men, it would have been impossible for 
acres to have correlation with income even after the relation to cows and 
men was allowed for 

Beta Coeiticients The importance of individual vanablcs may also 
be compared by their net regression coefficients The size of the regression 
coefficients, however, varies with the units in which each vanable is stated 
They may be made more comparable by expressing each vanable in terms 
of Us own standard deviation, using the beta coefficients mentioned m 
Chapter 9 In terms of betas, the regression equation for four variables 
would be 

■^l _ a ■^*4./! 

Pl2 31 *r + PiS 24 “ + Pl4 S3 “T + 0 

Jj S3 Si 

Hence the partial betas may be defined 

Past ~ ^1234“ (129) 

■»i 

For the problem we have been considering, the betas may be calculated 
very readily 

P 12 34 = P,23.| = I 2058 = 0 402 

Pl3 24 = Pj3 2J “ = 2 6274 ^ 2 ) ~ ^ 

Pi4 23 = Pii 23 “ = 5 0298 = 0 282 

If the relative importance of each of the different factors, as judged 
by the two different types of individual measurement, is compared, the 
relations are as shown m Table 12 2 



Accuracy of Estimate and Degree cf Correlation 


m 


Table 12.2 

Relative Importance of iNomouAL Factore Affecting Income, 
AS Indicated by Two Diffep,ent Coefficients 


Independent 

Factor 

Factors Held Constant 

Coefficients 
of Partial 
Correlation 

Beta 

Coefficients 

p 

''I Ml 

Acres {X^ 

Cows (A'a), men (A'^) 

O.il 

0 .-t 02 

Cows (X 3 ) 

Acres (Ay, men (Xf) 

0.75 

0.926 

Men {Xi) 

Acres (A'j), cows (ATy 

0.33 

0.282 


It is evident from this comparison that although the exact values differ 
for the two sets of measures, the rank of the three variables in order of 
importance is the same and the relative sizes are comparable. This does 
not always hold true, owing to the mathematical differences in the meaning 
of the Uvo sets. 

Reliability of Results from a Sample. .All the coefficients presented 
in this chapter are subject to fluctuations of sampling just as are simpler 
coefficients. Chapter 17 discusses the extent of these fluctuations with 
various sizes of samples and gives methods of estimating confidence 
interv'als when the sample statistics are used to draw inferences as to the 
probable values of the parameters in the universe. The comments made 
previously about the effects of purposeful selection of values of the inde- 
pendent variables upon simple and multiple coefficients of correlation 
also apply to coefficients of partial correlation and to beta coefficients. 
If only extremely low and e.xtreraely high values of are selected for 
inclusion in the sample, the coefficients rij.ai and f 12.31 vfi! tend to be high; 
if the values of Xn are confined to a narrow range, they will tend to be low. 
A similar, though less obsious, situation may arise when no special efforts 
are made to select values of X 2 , but when it happens that A'j is correlated 
with another independent variable the values of which are purposefully 
selected. Thus, if the partial correlation coefficient is to be regarded as an 
estimate of a universe parameter, it is important to keep clearly in mind the 
nature of the universe to which the inferences are expected to apply, as 
related to the universe represented in the sample. 

If the conditions of the sample are such that sample values can be used 
as bases for estimating the parameters in the universe (i.e., if the sample is 
of the “correlation model” type), the adjustments for degrees of freedom 


198 Determining Muhipie Unear Regressions 

explained m Chapter 17, as well as the error formulas, may become 
important 


Summary 


This chapter has shown that the accuracy of a regression equation for 
estimating one variable from two or more others may be measured by the 
standard error of estimate The extent to which variation m the dependent 
variable is associated with the variation in the several independent variables 
may be measured by the coefficient of multiple correlation, or, with respect 
to variance, by the coefficient of multiple determination The closeness 
of association between the dependent and each of the independent 
variables after the effects of the other variables have first been removed 
may (1) be measured by the coefficient of partial correlation, or (2) be 
indicated by the beta coefficients, which reduce the net regression co- 
efficients to a comparable basis, but both are of limited value except in the 
correlation modely as discussed in Chapter 17 

NOTE. In computing the index of correlation where a regression curve 
has been determined by using a transformed value of the dependent 
variable (as with the use of log Y instead of Y on pages 93-96), the index 
of correlation must be computed from the transformed values instead of 
the natural values That is. 



Sy 

~s 


where a « F — F" 



CHAPTER 13 


Practical methods Jor wovhing 
multivariable linear correlation 
and regression problems 


Here we will take a new illustration of a multivariable correlation 
problem, and w'ork through the calculations of the net regression co- 
efficients, the coefficients of multiple and partial correlation, and the 
standard errors for these statistics. The meaning of such standard 
errors is discussed in more detail in Chapters 17 and 19. 

For this exercise, we will use data on the per capita consumption of beef 
in the United States, and on three variables logically related to it— retail 
price of beef, deflated for changes in price levels; disposable income per 
capita, similarly deflated ; and the consumption of pork per capita. Tbe 
hypothesis assumed in selecting these data is that consumption of beef 
per person will be affected by real prices of beef, real income available for 
expenditure, and consumption of the other main meat, pork. The figures, 
thus adjusted, are given in the first five columns of Table 13.1. 

The final column in the table, headed check sum, is calculated by 
making a total of all the other values for each observ'ation, from Xi 
through X^. 

The next step is to add all the columns, including the Sg. and to divide 
by n, the number of observ'ations (20 in this case), and enter the average. 
The results are shown at the foot of the table. All the calculations to this 
point are now verified by the two equations; 

SAj “h SAj "h 2iAr3 H- XiX^ = S(So) (13. 1) 

and 

Mi+ M 2 + ^3+ M.^Mo (J3-2) 

That is, in each of these last two lines, the sum of the entries in the four 
columns from Xi to X^ will equal the sum of the Sq column. 

199 



200 Determining Multiple Linear Regresslom 

Table 13 I 


Factors Relattd to Beef Consumption, USA 


Year 

Beef 

Consumption 
per Capita, 

Beef Price 
at Retail, 
Deflated • 

•^1 

Disposable 
Income 
per Capita, 
Deflated • 
Xt 

Pork 

Consumption 
per Capita, 

Check 

Sum 


Pounds 

Cents 

Dollars 

Pounds 


1922 

59 I 

23 I 

452 

65 7 

599 9 

1923 

59 6 

236 

505 

74 2 

662 4 

1924 

59 5 

24 1 

499 

74 0 

656 6 

1925 

59 5 

24 5 

507 

668 

657 8 

1926 

603 

24 8 

515 

641 

6642 

1927 

54 5 

26 5 

520 

67 7 

668 7 

1928 

48 7 

30 5 

533 

70 9 

683 1 

1929 

49 7 

32 0 

556 

69 6 

707 3 

1930 

48 9 

30 3 

506 

67 0 

6522 

1931 

48 6 

27 6 

474 

684 

618 6 

1932 

46 7 

25 5 

400 

70 7 

542 9 

1933t 

51 5 

233 

394 

69 6 

538 4 

I934t 

55 9 

244 

430 

63 I 

573 4 

1935t 

52 9 

31 1 

463 

48 4 

600 4 

1936t 

58 1 

28 9 

522 

551 

6641 

1937 

55 2 

31 7 

537 

55 8 

679 7 

1938 

54 4 

28 5 

502 

58^ 

643 1 

1939 

547 

29 7 

542 

64? 

691 1 

1940 

54 9 

29 5 

575 

73 S 

732.9 

1941 

609 

300 

663 

68 4 

8213 

S 

Means 

1093 6 

54 680000 

549 6 
27480000 

10,100 

505 0000 

13159 

65 795000 

130591 

651955000 


• Divided by Consumer Price Index, with 1935-39 ■■ I 00 

t Excludes quaniKies of beef diverted from oomul mirket chanseb under emei^ency 
Government programs 


The next step is to make the extensions and sums for all the variables 
and the check sum, as follot^s 

scad + SCATiA:^ + 2(Ar,A',) + SCAT^A'*) = scat, S o) 

serf) 

Tixt) +S(jr,jr,)=2(Ar,r,) 

Z(xt) = rcjr.s,) 




Practical Working Methods 2o; 

The accuracy of the extensions for each variable arc controlled by the 
identities 

2(zf) + 2 (t,a;) + + 2{avv,) = (13.3) 

+ X(XD + X(X,X,) + = 2(T,S,) 

etc. 

The sums are then adjusted to departures from the means of each 
variable, and the normal equations, together 'With certain additional 
terms to provide the partial correlation coefficients and the standard 
errors, are set up and solved. The method of computinc the solutions and 
the constants is given in detail in Appendix 2, pp. 507-520. This method, 
based on matrix algebra, provides all the needed values more rapidly 
than does the Doolittle method used elsewhere in this book. The new 
method, too, can be learned and followed by any intelligent student or 
statistical clerk, without advanced mathematical training. 

The solution of these equations, as shown in Appendix 2, gives the 
following statistics: 

1. Regression equation; 

A'l = 90.814 - 1.850X2 + 0.0832Xa - 0.415X4 
(0.146) (.0069) (0.054) 

2. Standard error of estimate: 

•S'l.234 = 1-371 

3. Coefficient of multiple correlation: 

7^1.234 ~ 0.980 

4. Coefficients of partial correlation: 

T 2.34 “ 0.954 

'"13.24 = 0.950 
'"l4.23 ~ 0.887 

And computing the standard deviations of the four variables, the /? 
coefficients can then be calculated as follows: 

Pizsi ~ — 1-^7 A 3.24 ~ 1.130 ^4.23 ~ 0.631. 

Interpretation of the Results. The regression equation shows that 
during the 20-year period, beef consumption was significantly related to 
all three variables, beef prices, income, and pork consumption. Since 
the computed net regression coefficients are all more than 7 times their 
own standard errors, the chance of getting values as large as this from 



202 Determining Multiple Linear Regressions 

a universe tn which X'l is uncorrelated with the other variables is practically 
2 ero — even if we talce into accoimt the fact that after allowing for the 4 
constants m the equation, we have only 16 degrees of freedom to consider 
in consulting tables such as Table 23 If ±2 times the standard error is 
taken as a reasonable confidence interval, there is a high probability 
(P = 0 93) that the coefficients of Xj and Xj are within 16 percent, and that 
of Xi within 26 percent, of the corresponding universe values (See Chap- 
ter 20 for a discussion of sample and universe relationships in time senes ) 
The regression coefficients as they stand indicate that per capita beef 
consumption tended to decline 1 85 pounds for every increase of 1 cent 
in beef prices, to rise 0 83 pound for every advance of 10 dollars in per 
capita income, and to decline about 04 pound for every increase of 1 
pound in pork consumption (These increases of pork consumption 
presumably were usually accompanied fay corresponding decreases in 
pork prices — since that was not considered separately the effect here is 
the net result of changes in both pork consumption and associated pork 
prices) Altogether, the standard error of estimate indicates that the 
errors in estimating beef consumption from the three other factors, for 
the period studied, had a standard deviation of only J 37 pounds For 
estimating the average errors expected in using this equation to make 
estimates for other observations drawn from the same universe, however, 
we must compute With n * 20 and m =* 4, this comes out, by 
equation 12 1, to I 53 pounds, only slightly larger 
To really measure the average elasticity of demand with respect to 
beef price and income, we should have used loganthnuc values for out 
vanables instead of arithmetic We can, however, make a rough approxi- 
mation by estimating the response for charges al the mean values, as in 
Table 13 2 


Table 132 


Independent 

Variable 

Mean 

Value 

(1) 

1% Change 
from Mean 

(2) 

Net 

Regression 

0) 

Corresponding 
Change m 
Consumption 
AT 

(2) X (3) 

(4) 

Change Divided 
by Mean Beef 
Consumption 
(4) - 54 68 
(5) 

Beef price, A", 

27 48 

02748 

-1 85 

-0 5084 

-0 930^ 

Disposable 

income, AT* 

505 

5 05 

00832 

0 4202 

0 768% 

Pork con 

sumption, A'l 

65 795 

0 658 

-0415 

-0 2731 

-0 499% 



Practical Working Methods 203 

At their means, the elasticity of demand for beef with respect to price 
is thus -0.93, or almost unity; the elasticity with respect to income i"^ 
0.77, and the cross elasticity with respect to pork consumption i-^ -0.50. 

(For an exercise in the new method of computation, students may v.ssh 
to rework this problem using logarithmic xalues of the four variables. 
This will also serve to show how much the approximate economic con- 
clusions just presented are changed by using the logarithmic transforma- 
tions which logically are better suited for measuring these relations ) 

Finally the coefficient of multiple correlation. 0.920, indicates that the 
three variables together explained 96 per cent (^) of all the vanance in 
annual beef consumption observed dunng this 20->car period. If dus 
is interpreted from Figure 17.2, as explained in Chapter ^2. there is only 
1 chanre out of 20 that so high a multiple correlation could have been 
obtained from a sample of this size, unless the true correlation in the 
universe was at least 0.95. (For other considerations in applying such an 
interpretation to time series, see Chapter 20). Since the vanables are 
logiSlly consistent, as related to the independent variable, the results ar 
thus highly significant as measuring the mam factors explaining i 
fluctuations in beef consumption. 



SECTION IV 


Multiple 
Curvilinear Regressions 


CHAPTER 14 

Determining multiple 
curvilinear regressions by 
algebraic and graphic methods 


The discussion of multiple regression to this point has been limited 
to linear relationships — where the change in the dependent variable 
accompanying unit changes m each independent variable was assumed 
to be of exactly the same amount, no matter how large or how small the 
independent variable became Thus in the farm income example, it 
was assumed that each additional cow would be accompanied by the 
same increase in income, no matter whether it was the first, the tenth, 
or the thirtieth Similarly, each additional acre in crops or each additional 
man employed was assumed to be accompanied by an identical contribu- 
tion to the income, no matter how large or how small the business already 
was It IS quite evident that such an analysis makes no provision for there 
being an otpimum size of operation for given circumstances, in this 
particular case, it assumes that there is no such thing as the principle of 
diminishing returns Such an analysis might therefore fail entirely to 
reveal the proper size of productive unit, or the number of each of the 
several elements to be employed to yield maximum returns 
In many other types of problems for which multiple regression analysis 
might be used, limitation of the analysis to linear relations would seriously 
restrict its value or prevent its use altogether In dealing with the effect 
of weather upon crop yields, several variable weather factors are usually 
concerned There may be an optimum point for growth, with respect 
204 




Determining the Regressions by Algebraic and Graphic fAethods 205 

to both temperature and precipitation, with \-aIues either above or below 
the optimum tending to produce lower yields. Linear rearcssions arc 
obviously unfitted to express such relations. In problems such as these, 
and many others which might be enumerated, determination of the cuni- 
linear relation between independent and dependent variable, svhilc simul- 
taneously eliminating the effect of other factors which also affect the 
dependent variable, is the most important feature in the investication. 

The problem in its simplest outlines may be stated as follows: Given 
a series of paired observations of the values of a dependent variable 
and two or more independent variables X., JVj. X.. etc., required to find 
the change in X^ accompanying the changes in Xo, A'., and A'^. in turn, 
while holding the remaining independent factors constant, so that for any 
given values of Xn, X^, and etc., values may be estimated for X^, 
according to the regression equation 

= a' + MX.J +UX^) -f f,{X,) + ... (14.1) 

The expression /jCAQ is used here simply as a perfectly general term mean- 
ing any regular change in Afj with given changes in A'-, whether describabic 
by a straight line or a curve. The equation is read “A'j is ^ function of 
X 2 plus a function of Afj,” etc. 

The several partial (or “net”) regression curves may be determined 
either by the use of definite mathematical expressions, one for each 
independent variable, with the constants all determined simultaneously 
just as in linear multiple correlation ; or by a method known as “successive 
graphic approximation,” which involves no prior assumptions as to the 
shapes of the curves. 

Multiple Regression Curves Mathematically Determined 

In using definite mathematical functions, it is necessary' to express the 
curvilinear relations by simple mathematical curs'cs of some type, so that 
the constants for the curves may be determined bj' methods similar to 
those already presented. If simple parabolas were used, involving only 
the first and second powers of each independent variable, equation (14,1) 
could be expressed 

A\ = c -4- b^X^ + ho-(Afo) -f + by{Xi) •+• b^X^ -f b^fX^) (14.2) 

In practical research we must distinguish between two situations. If 
we know, on the basis of established theory or of previous studies invoU ing 
the same variables, the mathematical form of the net relationship between 
the dependent and each independent variable, we can proceed at once 
to fit the appropriate multiple curvilinear regression function by the 



206 Muhiple CurviUntar Regnsthns 

method of least squares The only unknowns in this situation are the 
numencal \alues of the constants of the prcscnbed function 

If \\c do not know the forms of the net regression curves, we have a 
more complex version of the problem in Chapter 6 m which we determined 
from the data a freehand curve expressing the relationship between wheat 
protein and the proportion of vitreous kernels Unless we were willing 
to compute a large number of alternative mathematical functions, being 
guided as to goodness of fit only by the respective standard errors of 
estimate, we would be obliged first to examine our data by graphic means 
to determine the approximate forms of the net relationships implicit 
in them 

The bulk of this chapter will deal with the second situation In the 
next few pages, however, we shall outline computational methods for 
the situation in which the mathematical forms of the net regression curves 
are known 

Determining the Curves by Least Squares. The process of determin- 
ing net regression curves by the use of a definite mathematical equation 
may be illustrated for the following data 


Xs 

A-, 


Xi 

A'3 


1 

3 

8 

0 

2 

7 

2 

2 

iO 

4 

5 

9 

4 

7 

8 

3 

3 

10 

9 

8 

9 

1 

2 

9 

3 

5 

10 

6 

5 

10 

2 

3 

9 

1 

2 

9 

2 

2 

9 

2 

2 

10 

7 

14 

7 

4 

14 

6 

9 

8 

9 

1 

2 

9 

2 

4 

8 

10 

11 

8 


Let us assume (on the basis of theory or of other studies of these variables) 
that the net regressions of A'j on both and X3 may be appropriately 
represented by parabolas Accordingly, the multiple regression equation 
IS of the form 


X^^a + hjA's + b’^iXl) + b^X^ + b’.iXl) 

The arithmetic required to determine the five constants can be reduced 
by “coding” the squared values Let t/ = Xl/lO, and V = XjllO The 
regression equation then may be wntten 


Xi = a + btX^ + b^U -f biX^ -f b,V 



Determining the Regressions by Algebraic and Graphic Methods 207 

The normal equations to determine the constants are next obtained in 
exactly the same manner as described in Chapter ! 1 for a multiple regres- 
sion involving four independent variables. The resulting normal equations 
are: 

( 24 ) 6 , + (Zx„u)b^ -f + CSr„r) 6 , = 

(Sx^)bo -f (Zuz^b^ -f 2{3f)&3 -f (^^v)b^ 

-f (Swe)^u + -f- (2i~)6p = Sx^r 

Carrying out the required computations, the equations arc found to be: 
171.75052 + 170.6255„ + 165.00053 -f 207.6005^ = -2.50 
170.6255, -{- I81.I655„ + 153.54053 + 192.3165,. = -5.31 
165.00052 + 153.5405„ + 295.20053 + 441.4805, = -50.80 
lOl.eOOb. + 192.3165„ + 441.48053 -f 696.012b^ = -86.52 
The (Sxf) = 24.20. 

Solving these equations by the usual method, and computing a from 
equation ( 11 . 12 ) by restating it 

^i.2u3r b^ATo — 5y^fy — b^AT^ — 5,3/c 

we find the regression equation to be 

Xi = 9.411 + 1.2709 A", - 0.7337C/ - 0.9957^3 + 0.3309K 

The net regressions of A'j on A^, and A '3 are now shown by the two 
parabolic equations: 

= 5.596 -f- 1.2709 A, - 0.07337At 
and 

Ai = 12.515 - O. 9957 A 3 + 0.03309 Ai 

The graph of these two curves is shown in Figure 14.1. 

The multiple correlation of Aj with A,, U, A 3 , and V is 0.968. Tins is 
the index of multiple correlation of Aj with A, and A 3 , according to the 
parabolic regressions. The standard error of estimate, adjusted by 
equation ( 12 . 2 ) for the five degrees of freedom used up in determining the 
five constants, is found to be 0.319. 



20S MuWpfe CurvJfmeor Regressions 

It should be noted that where net curvilinear regressions are found 
by this method, the number of constants assumed in the regression equa- 
tion IS defimtely known, and there can be no question as to the exact 
correction to apply to the standard error of estimate 


5 60 + 1 2709 Xi- 007337 X{ 


n 

■ 

■■ 

■ 

■ 



■ 

nBH 

■ 

■ 

■ 

■ 

n 

H 

B 


8 10 12 14 

^3 


fig 14 I. Parabolic regression curves, fitted simultaneously, and net 
residuals 


Testing the Fit of the Curves. If there were any doubt as to the 
appropriateness of the parabohc relationships m this case, we could check 
the goodness of fit of each net regression curve much as we appraised the 
fit of the curves in Chapter 6 which involved only one independent van- 
able To make this check, after the regression equation is determined for 
the particular curves selected, estimate values of are calculated from 
the equation The residual difierences between Xi and these estimated 
values are then computed These residuals are then plotted as departures 







Determining the Regressions by Algebraic and Graphic Methods 209 

from the mathematical net regression cunes. C-arrvinE this out for the 
problem illustrated, we obtain results as follow: 

Aj X 3 Aj z W . 

1 3 8 7.9 0.1 0 2 7 6.7 0. 

2 2 10 9.8 0.2 4 5 9 9 •) _Q -) 

4 7 8 8.0 0.0 3 3 10 9.9 0.1 

9 8 9 9.1 -0.1 1 2 9 8.S 0.2 

5 5 10 9.8 0.2 6 5 10 10.2 -0.2 

2 3 9 9.0 0.0 12 9 8.8 0.2 

2 2 9 9.8 -0.8 2 2 10 9.8 0.2 

7 14 7 7.2 -0.2 4 14 6 5.9 0.1 

9 8 9 9.1 -0.1 1 2 9 S.8 0.2 

2 4 8 8.2 -0.2 10 11 8 7.8 0.2 

The residuals obtained above are plotted as departures from the para- 
bolic net regressions, as shown in Figure 14.1. It is evident in this case 
that the parabolic regressions represent the relations quite well, with the 
departures in general evenly distributed on both sides of each curve 
throughout their length. 

Any other type of mathematical function, the parameters of which can 
be expressed in the first degree, can be used to determine net regressions 
by the method of least squares. Besides the third, fourth, or higher 
powers of X^, such transformations as 10/ ZS, lOOjXS, log A'™, and I/log A; 
may be employed as independent variables, either in place of the pre\ious 
independent variables or as an addition to the simple statement of them. 

The simple parabola is not very flexible and will often fail to represent 
adequately the relationships existing in sets of data. On the other hand, 
situations in which theory or previous experiments point unequivocally 
to higher order parabolas as expressing the true relationships are relatively 
infrequent. 

If the more flexible cubic parabola is employed, invohing the first, 
second, and third powers of eachindependent variable, the equationbecomes 

A'l = fl -F bziXo) -f bniXl) 4- b.y{X^^ -F b^iX^) -F bsfA?) 

-F -F biiXt) -F b,U'-i) -F b,Ui) (14.3) 

This last equation for three independent x'ariables involves ten constants, 
and the clerical labor of dealing with the squared and cubed values is 
large (unless they are coded). 'The cuives corresponding to the three 
functions in equation (14.3) are: 

/j(Ay = b.X. + b,.{Xl) -F by{Xl) 

MX^) = b,X:, -F bAXl) -F briXl) 

= b^X. -F bA^l) -F bA^i) 



210 Multiple Curvilinear Regressions 

Whether or not these curves will actually be a good fit to the true functions 
can rarely be told beforehand 

Very frequently we find ourselves m the second situation referred to 
above, and must determine the net regression curves by examining the 
data at hand A dot chart of the observations on Xi and may suggest 
some definite mathematical function However, the true net relationship 
may be obscured by the effects on X^ of other independent variables 
These may be linearly mtercorrelated with A'j, we have seen in earher 
chapters how intercorrelation can mask the true net relationships even 
in linear regression If X^ is related in an unknown curvilinear fashion 
to A'a and other vanables it may be quite impossible to choose an appro- 
pnate mathematical equation for the net regression of Xi on A'j simply by 
looking at the dot chart for these two variables 

In such a situation we need a new method to guide our exploration of 
the data for the underlying curvilinear relationships Such a method, 
called the method of successive approximations, is presented below 

Multiple Regression Curves by Successive Approximations 

The general method of determining partial regression curves by the 
successive approximation method may be outlined as follows 

The conditions to be imposed on the shape of each curve, m view of the 
logical nature of the relations, are first thought through and stated This 
procedure, for each curve, is similar to that described on pages 103 to 108 
of Chapter 6 

The linear partial regressions are computed next Then the dependent 
variable is adjusted for the deviations from the means of all independent 
vanables except one, and a correlation chart, or dot chart, is constructed 
between these adjusted values and that independent variable This 
provides the basis for drawing in the first approximation curve for the 
net regression of the dependent vanable on that independent variable, 
within the hrmtations of the conditions stated The dependent vanable 
IS then adjusted for all except the next independent vanable, the adjusted 
values plotted against the values of that variable, and the first approxima- 

out for each independent vanable in turn, yielding a complete set of first 
approximations to the net regression curves These curves are then used 
as a basis for adjusting the dependent factor for the approximate curvi- 
linear effect of all independent variables except one, leaving out each 
in turn, and second approximation curves are determined by plotting 
these adjusted values against the values of each independent vanable 
m turn New adjustments are made from these curves, and the process 



Determining the Regressions by Algebraic and Graph'C Methods 211 

is continued until no further change in the scsenil regression curves is 
indicated. 

The process of determining net curvilinear regressions by the succcssisc 
graphic approximation method may be illustrated by the data shoNsn in 
Table 14.1. These data show, for a period of 38 vears, the a\erace 
rainfall during June, July, and August, for nine weather stations scattered 
through the Com Belt. This precipitation has been designated as variable 





Fig. 1*1.2. Rainfall, temperature, and com yields in the Com Belt, 

1S90 to 1927. 

X 3 . The average temperature during the same months, at the same stations, 
has been designated as .V^. The average yield of corn per acre, in the six 
leading Com Belt states, is shown as A'j — the variable uhosc fluctuations 
are to be explained, so far as possible, by the other factors. 

It is evident from the table that there was an upward trend in com 
yield during this period, although there was not a similar trend in rainfall 
or temperature. Plotting each one of the three factors, X^, X.. and Xj as 
shown in Figure 14.2. %\c notice, however, that there have been marked 
though irregular long-time cycles in rainfall and temperature during the 
period. To a certain extent the swings in yields have agreed with the high 
point of the rainfall cycles. It is not safe, therefore, to fit a long-time 






2/2 


Multiple Curvilinear Regressions 


Table 14 I 


Yield of Corn Rainfall, and TtMPERATVRE in Six Leading States, and 
Yield Estimated by Linear Regressions on Three Factors* 


Year 

Time, 

JTj 

Rainfall 

A's 

Temperature, 

Yield, 

A'l 

Estimated Difference, 
yield, - Xi 

Xi 



inches 

d^rees 

bushels 



1890 

0 

96 

748 

24 5 

28 4 

-3 9 

1891 

1 

129 

715 

33 7 

31 6 

21 

1892 

2 

99 

742 

27 9 

29 1 

-12 

1893 

3 

87 

74 3 

27 5 

28 5 

-10 

1894 

4 

68 

75 8 

217 

27 0 

-5 3 

1895 

5 

125 

741 

31 9 

30 9 

10 

1896 

6 

130 

741 

368 

314 

54 

1897 

7 

101 

740 

29 9 

300 

-01 

1898 

8 

101 

750 

302 

29 7 

05 

1899 

9 

101 

75 2 

320 

29 8 

22 

1900 

10 

108 

757 

34 0 

301 

39 

1901 

11 

78 

784 

194 

27 5 

-81 

1902 

12 

16 2 

726 

360 

34 6 

14 

1903 

13 

14 1 

720 

30 2 

33 8 

-3 6 

1904 

14 

106 

719 

32 4 

321 

03 

1905 

15 

100 

740 

36 4 

31 1 

53 

1906 

16 

11 5 

737 

369 

322 

47 

1907 

17 

13 6 

73 0 

31 5 

33 7 

-22 

1908 

18 

121 

73 3 

30 5 

329 

-2.4 

1909 

19 

120 

74 6 

32 3 

32 5 

-02 

1910 

20 

93 

736 

349 

316 

33 

1911 

21 

77 

76 2 

30 1 

29 8 

03 

1912 

22 

no 

732 

369 

33 0 

39 

1913 

23 

69 

77 6 

26 8 

29 1 

-2 3 

1914 

24 

95 

769 

305 

310 

-0 5 

1915 

25 

16 5 

699 

33 3 

37 7 

-4 4 

1916 

26 

93 

75 3 

297 

318 

-21 

1917 

27 

94 

72 8 

350 

33 0 

ZO 

1918 

28 

87 

762 

29 9 

314 

-15 

1919 

29 

95 

760 

35 2 

32 1 

31 

1920 

30 

116 

729 

38 3 

34 6 

37 

1921 

31 

12 1 

76 9 

35 2 

33 4 

1 8 

1922 

32 

80 

750 

3S5 

321 

34 

1923 

33 

107 

74 8 

36 7 

33 8 

29 

1924 

34 

13 9 

726 

268 

36 5 

-97 

1925 

35 

11 3 

75 3 

38 0 

34 2 

38 

1926 

36 

n 6 

741 

317 

35 0 

-3 3 

1927 

37 

104 

710 

32 6 

35 7 

-31 


•Data from E O Misner, Studies of the telation of weather to the production and 
pticeoffarmproduas,! Com Mmoographedpubhcalion, Cornell Umversrty.Marcb, 
1928 The six states are Iowa, lUmois. Nebraslra, Missouri, Indiana, and Ohio 


213 


Determining the Regressions by Algebraic and Graphic Methods 

trend to yield and to assume that in removing that trend \vc arc merely 
taking out the cfTccts of such factors as better varieties, improved methods 
of tillage, or concentration of acreage in the more fertile sections. Since 
there is some association between rainfall and lime, at least over consider- 
able periods, in eliminating all the variation associated with time we might 
be eliminating a part of the variation which really reflected differences 
in rainfall. Accordingly we may make time itself one of the factors in 
the multiple regression and ascribe to lime only that part of the long-time 
change in yields which is not associated with dilTcrcnccs in rainfall or in 
temperature. Each year, numbered from 0 up, is therefore included as one 
of the factors in the multiple regression^ and is designated as variable A'-. 

Before starting the statistical process, \vc must state the conditions 
to be observed in fitting a curve to each function. For rainfall, the 
considerations are quite similar to those discussed in Chapter 8 for 
irrigation svater applied, so we shall use the same conditions as stated 
there (page 140). 

For temperature, the range of possible relations might be wider. There 
may be certain temperatures to which the plant does not respond and then 
certain higher temperatures which produce a marked response. Again, 
if the temperature is too high, a marked reduction in yield might be 
produced.® These considerations lead to the following conditions for the 
temperature curve; 

1. It might rise not at all or slowly in the lower range, then more 
steeply, then taper off until a maximum is reached. 

2. It might decline after the maximum, gradually or sharply, but would 
have only one maximum. 

3. It might have one point of inflection at a temperature below that 
which has the maximum favorable effect on yield. 

With respect to the third curve, that for trend, there is no a priori 
reason to expect any given shape during the period concerned, except 
that there be no sudden changes from year to year. Accordingly, tlie 
only condition imposed is that the trend have a smooth, gradual change, 
with no sharp inflections. 

* Note the parallel treatment of changes in time as an independent factor in R. A. 
Fisher, Statistical Methods for Research Horkers, 12th ed., pp. 206-8, Oliver and Boyd, 
Edinburgh and London, 1954. 

• More elaborate investigations, experimental and statistical, have shown that the 
effect of both temperature and rainfall xary at difTcrent times of the season, and especially 
at certain critical times in the growth of the plant, such as at tasseling. Also, the 
particular combination of moisture and heat may be important. These possibilities 
will be referred to subsequently, in connection with more refined and elaborate methods 
of analysis. 



214 Multiple Curvilinear Regressions 

As a preliminary step before starting to determine the net regression 
curves, we may examine the apparent relation of yield to rainfall, before 
the other factors (temperature and time) arc taken into account 
The apparent relation between rainfall (Xg) and yield is indicated 
in Figure 14 3, by a dot chart of the relation, with the average yield 
indicated for each group of years of simdar rainfalL The broken line 
connecting these averages indicates that there is a marked curvilinear 
relation, a onc-mch increase in rainfall being associated with a larger 
increase m yield when the basic level of rainfall is low than when the basic 
level of rainfall is high Filling a straight regression line to these two 
variables, the relation is found to be 

jTi = 23 55 + 0 776 A's 

This line is accordingly drawn in on the cluirt, cutting across the curve 
indicated by the line of group averages 


I 

i 

I 

I 

? 

I 


Fig 14 3 Apparent relation or corn yields to rainfall (with simple and net 
T^ression lines) 

Although Figure 14 3 shows yields to be definitely associated with 
^iv^feiences-miaiTiSa’jIi, It Tnuslloe noted that rainlaHis significantly corre'iated 
with Xi, temperature, the correlation being = —0 67, and is also 
slightly correlated with time To some extent, then, the changes in yield 
shown in the figure to be associated with differences in rainfall may really 
be due to concomitant differences m the other two factors The extent 
to which these other two factors may have influenced the relations can 
be judged by determining the multiple correlation of Xj with all three 
factors, and then noting how the slope of the regression of Xi on X^ alone 



Rainfall (inches) Xs 



Determining the Regressions by Algebraic and Grcpnlc Methods 2/S 

which has just been shown plotted in the figure, compares witli the 
slope of the net regression of A\ on ,^'3 determined while simul- 

taneously holding constant the linear effects of and .V,. The first step 
toward determining the net regression cun-e. therefore, is to determine 
the multiple regression equation and the coefficient of multiple correlation, 
according to the methods outlined in Chapters 12 and 13. 

The regression equation works out to be 

= 53.S05 -f 0.I46A'; -h 0.537X3 - 0.405X^ 
and the multiple correlation, is 0.55.=> 

Tlxis result shows that when the net linear influence of trend and of 
temperature is allowed for, yield increases on the average only 0.54 bushel 
for each increase of 1 inch in rainfall, whereas, before these other factors 
were taken into account, yield appeared to increase 0.78 bushel with 
each additional inch of rainfall. The difference between the simple 
regression and the net regression may be showm by plotting the latter 
as well in Figure 14.3,'* It is then quite apparent how different are the 
relations as shown by the two lines. 

Considering the effect of the other factors reduces the slope of the 
linear regression of A'j on X3 by nearly one-third. If other factors have 
so much effect on the average linear relation, they may have an even 
greater effect on the shape of the curve. The net regression line in Figure 
14.3 shows the average change in the values of A'l with different values 
of X 3 , after the differences in X 3 and X4 arc taken into account. The 
average yield for different groups according to rainfall, connected by 
the broken line, shows definitely that the simple regression line is but 
a poor indication of the underlying relation between A'j and A's- The 

’ Using units of time in years, rainfall in tenths of inches, temperature in tenths of 
degrees, and com yields in tenths of bushels, wc find the normal equations for the data 
of Table 14.1 to be: 

4,569.50.6,5 5 , q- 248.006,,..,- S.SOft,,.,, = 6,813.00 

248.006,5.5, q- 18,989.066,5.5, - 10,279.4I6„.;5 = 14,726.97 
-8.506,5.5, - 10,279.416,5.5, + I2,40S.866„,5j = -S.442.64 
mf = 70,455.03; s, = 43.0, or 4,3 bushels. 

* The net regression line, showing the change in sield with changes in rainfall while 
holding constant time and temperature, may be computed from the multiple regression 
equation by substituting the average values for time and for temperature for A '5 and X,, 
and then working out the new- constant. For the data given in Table 14.1 , the averages 
arc: 

Af. = 18.500; Af, = 10.784; Af, = 74.276; Af, = 31.916 
If we substitute the means of X. and A', for their values in the multiple regression 
equation, that equation becomes: 

A', = 53.505 + (0.146)(1S.500) + 0.537A', - Q3.40SyJA.Ti6) = 26.124 + 0 . 537 A '5 
The net regression line in Figure 14.3 is therefore drawm in from this last equation. 



216 Multiple Curvilinear Regressions 

net {or partial) regression line may be an equally poor indication of the 
relation with the other factors held constant What is needed is some way 
of seeing the differences in the mdwidual values of Xj for different values 
of Xs, ^ter the vanation due to Xj and X4 has been eliminated It is 
impossible to do this entirely, for we have as yet no measure of the 
curvilinear relation of X^ to X2 or X3 But wc do have our net regression 
coefficients, which measure the linear regression of Xi on these other 
factors, and by using them we can eliminate from Xj that part of its varia- 
tion associated with the linear effects of Xj and X4, and then see if that 
gives us any clearer picture of the curvilinear relation between X, and Xj 
Determining the "First Approximation" Net Regression Carves. 
Having determined the hnear multiple regression equation, we next 
calculate the estimated value of Xj for each one of the 38 observations, 
by substituting the corresponding values of Xj, Xg, and X4 m the equation 
Each of the estimated values (X{) is then subtracted from the actual 
value (Xi), giving the residual values (2), as also shown in Table 14 1 
The next step is to construct a scatter diagram to show the relation 
between variations in X, and the vanation m Xj after that associated 
with Xj and X4 has been eliminated To do that, the net regression Ime 
for Xi on Xg is plotted on Figure 14 4, just as it had been on Figure 14 3® 
The residuals for each observation, from Table 14 1, are then plotted 
on the chart, with their Xg value for abscissa and with the value of z as 
ordinate /rom the net regression line as zero base For the first observation, 
Xg s* 9 6 and 2 = — 3 9 The ordinate of the point on the net regression 
line corresponding to Xg = 9 6 is 31 3, and the dot for this observation 
IS correspondingly plotted 3 9 lower than that, at 27 4 For the second 
observation, Xg = 12 9 and 2 *= -f2 1 The ordinate of the point on 
the regression line corresponding to X, = 12 9 is 33 1, so the dot for 
this observation is plotted at 33 I 4- 2 1, or 35 2 After the corresponding 
operation has been earned out for all the observations, the figure appears 
as shown in Figure 14 4 ® 

* To plot the line, all that is necessary is to take the equation of the Lne to be used 
(see previous footnote) 

X, = 26 124 + 0 537X* 

and subsljtute any two conveaignS valws for Xt, say 6 asd 16 

ForX,= 6,X, = 26I24 + ^537)(6) = 2935 
For X, = 16. X, = 26 124 + (0 537)(16) = 34 71 
With these two sets of coordinates, the Lne is then drawn in with a straight edge 
through the points indicated 

* The simplest way of plotting the mdmdDal observations is to use a scale, which 
can be sLd along the regression Ime as zero The values of z are then plotted directly as 
vertical deviations from the pomts on the regression line corresponding to the particular 
values of the independent vanable considered, as X» in the present case 



Determining the Regressions by Algebraic end Graphic Methods 


217 


]f Figure 14.4 is compared with Figure 14.3. it is rcadiiy seen that the 
scatter of the dots has been reduced. This will always be true when the 
other variables show any significant relation to the dependent factor; 
that is, when exceeds The scatter is reduced because that part 
of the variation in Aj which can be expressed as net linear functions of 
Xo and lias nosv been eliminated.' 



Fig, 14.4. Rainfall and yield of com adjusted to aserage temperature and year, v.nd 
first approximation curve fitted to the averages. 

Consideration of Figure 14.4 can be facilitated by computing the 
means of the ordinates corresponding to the values of A'j falling within 
convenient intervals. These can be obtained by simply averaging together 
the a values for each selected group of values of and plotting those 

' This can readily be pro\cd. Each point on the net regression line was obtained by 
the formula: 

{A) Xj — Ci.t5j dr hii.jiAf; + hij jiA j 4" hn.jjAft 

To these values have been added the residuals, :. Tliese residuals equal A'l — A'j, and 
therefore for each observation are equal to 

(B) A’l — Oi.-3« “ hij.j«As — his.i»A j 6n,«A t 

The ordinate of each dot in Figure 14.4 is the ordinate of the regression line plus r, and 
is therefore equal to the sum of the two equations, (A) and (B). If wc use -r to represent 
these ordinates, they arc therefore equal to 

— = ai.tji "b hii.ji.Afj + hii.jiAj -r t Ai Oi.tn ~hit.siAt 

— hij.jiA'j — hu.sjA I 

rr = Ai — his.u(Aj Afj) Afil 

~ — A'l — hiJ.iie"; hu.53^« 

The adjusted saiucs shown on Figure 14.4 arc therefore simply the salues of A'l less 
net linear corrections for deviations in A. and X % from their mean salues. 



218 Mukipk CurvUinear Regressfons 

averages as deviations from the regression line, just as the individual 
deviations were plotted previously The necessary averages are as shown 
m Table 14 2 

Table H 2 


Average Values of *, for Corresponding Xg Values 


A’a Values 

Number of Cases 

Average of Xg 

Average of z 

Under 8 0 

4 

7 30 

-3 85 

80-99 

10 

9 19 

+0 16 

10 0-10 9 

8 

10 35 

+ 1 49 

11 O-Il 9 

5 

1140 

+2 56 

12 0-13 9 

8 

12 76 

-0 52 

14 0 and over 

3 

15 60 

-2 20 


These averages, when plotted m the same manner as the individual 
observations and connected by a broken line, give the irregular line also 
shown m Figure 14 4 Comparing this line with the similar one in Figure 
14 3 we see that though the lines are in general similar there are some 
marked differences The average for the second group (X 3 * 8 0-9 9) 
is now above the straight net regression line, whereas previously it was 
below It Likewise the average for JlTjas 14 and over is now slightly 
below the average for ^"3= 120-13 0, whereas before it was a little 
above it Also, the difference between the first two averages is not so 
large as it appeared before Apparently part of the previous deviations 
reflected other independent factors 

It is quite evident that a regression curve is indicated, rising sharply 
to a maximum yield between 10 and 12 inches of ram, then declining 
gradually for higher rainfalls Such a curve is accordingly drawn in 
freehand, passing as near to the several group averages as is consistent 
with a continuous smooth curve, and yet conforming to the limiting 
conditions as to its shape This curve is the first approximation to the 
curvilinear function 

which was required to be determined while simultaneously taking into 
account the curvilinear effects of Xg and Xt on Xi It is only a first 
approximation because it has been determined while allowing for only 
the net linear effects of the other two variables If their curvilinear effects 
were determined and allowed for, that might change somewhat the 
shape of this curve 



2f9 


Determining the Regressions by Algebraic and Graphic A'ethods 

The next step is to determine similar first approximations to the curs-i- 
lincar relation between and X^, and between A'j and A 4 . with the 
net linear efiects of the other variables eliminated just as has been done 
for X 3 . It is not nccessarv’ to plot the apparent relation between A'j and 
Xn or A'j and X^. This was done in the case of A '3 (F*igure 14.3) solely 
to illustrate the dilTcrence between taking the apparent relations and taking 



Fig. 14.5. Time and yield of com adjusted to ascrage temperature and 
rainfall, and first approximation curxe fitted to the ascrages. 

the net relations after the linear influence of the other factors had been 
allowed for (Figure 14.4). Instead, we may proceed at once to examine 
the net relation of A^i to A;. Figure 14.5 shows this step. This figure is 
constructed exactly as was Figure 14.4, by the following steps: (1) Plot 
the net regression line.® (2) Plot in the individual residuals, a. as deviations 
from that line.® (3) Average the residuals grouped according to X^, 

* The regression equation, for mean values of A'j and Ah, becomes 
A', = 53.505 -f 0.146A'; 4- 0.537(.SfJ - 0.405(Af,) 

= 53.505 + 0.146A'j + {0.537)(10.784) - (0.405)174.276) 

= 29.214 + 0.146Ah 

This equation is then the equation to which the net regression line in Figure 14.5 is 
drawn. Substituting the values A'. = 0 and X, = 20 in the equation, vwlues for .V, of 
29.214 and 32.13 are obtained, giving the coordinate points for drawing in the line. 

^For the first observation. A'. = 0 and ; = -3.9. The point on the regression line 
corresponding to A'. = 0 has an ordinate of 29.2. The dot for this obsenation is 
accordingly plotted at 29.2 - 3.9, or 25.3. For the next obsenation. Ah_= 1 and 
= = 2.1. ~Thc corresponding ordinate on the regression line is 29.4. so the dot is plotted 
at 29.4 4-2.1, or 31.5. The dot for each observation is plotted in turn in the same 
way, witii a sliding graphic scale to place the dots above or below the regression line. 




220 Multiple Curvilinear Regressions 

plot the group averages, and connect them by a broken line (4) Draw 
m a smooth curve through the line of averages, if a curve is inicated, 
conforming to the hmitmg conditions stated for this curve 
After the first two steps have been earned out, just as desenbed for 
Figure 14 4, grouping and averaging the residuals with respect to Xg 
give the averages shown in Table 14 3 


Table t4 3 

Average Values of * for COrrespondimo Values 


A's Values 

Number of Cases 

Average of Xg 

Average of z 

0- 7 

8 

3 5 

-0 38 

8-15 

8 

I! 5 

+024 

16-23 

8 

19 5 

+064 

24-31 

8 

27 5 

+0 26 

32-37 

6 

34 5 

-1 00 


The average residuals shown in the table are then plotted above and 
below the regression line in Figure 14 5 and connected by a broken line 
This line of averages indicates that com yield (for years of similar rainfall 
and temperature) rose rapidly dunng the earher years, then more and 
more gradually, until dunng the last ten years it tended to remain about 
on the same level A smooth continuous curve is therefore drawn through 
the averages, completing step (4) and giving the first approximation to the 
curvilinear net regression of on X 2 ,f^X^ 

The same operations are then carried out for Xf, as shown in Figure 
14 6 After drawing in the net regression line,^® and plotting m the 
individual observations,^' we group the residuals on Xt and average, 
with the results shown in Table 14 4 

The net regression line for Xi and may be determined by an alternative method 
to that used before On such charts as Figures 144, 14 5, or 14 6, the net regression 
line will always pass through the mean of Ihe two variables For Figure 14 6, therefore, 
Xi will have its mean value, 31 92, when X, has its mean value, 74^8 From the net 
regression coefficient, bn „ it is evident that each unit increase m A", is accompanied by 
—0 405 unit increase m Xi If X, is increased from 74 28 to 78 28, or 4 units, Xi will 
change by (-0 405X4) or - 1 6Z For X, = 78 28. A', wiU therefore be 31 92 - I 62, 
or3030 TTusgivesthetwosetsofpointsnceessarytolocate theline, when AT, » 74 28, 
Xi «= 31 92 and when X, = 78 28 AT, * 30 30 
“ The individual residuals are plotted m the same way as indicated in the other two 
cases, the residual —3 9 for A4 = 74 8 is plotted 3 9 units below the corresponding 
point on the regression Lne and similarly for tl» other observations 




221 


Determining the Regressions by Algebraic and Graphic Methods 

Tabte 14.4 


Average Values of 2 for CbRRESPOvorxa X^, Valcfs 


Xt Values 

Number of Cases 

.Average of Xt 

Average of 2 

Under 72.0 

4 

7I.0S 

- 1 . 2 s 

n.0-12.9 

5 

72.58 

-1.24 

73.0-73.9 

5 

73.36 

+1.46 

74.0-74.9 

10 

74.30 

+0.49 

75.0-75.9 

7 

75.33 

+0.91 

16.0-16.9 

5 

76.44 

+0.64 

77.0 and over 

2 

78.00 

-5.20 

76.0 and over 

7 

76.89 

-1.03 


The last group, on the first grouping, has but two cases, so the last two 
groups are combined, giving the averages shown in the last line. The 
fact that both the items above 77 degrees are low, also evident in Figure 
14.6, would give a little more reliability to the average based on only 
two items; but it is generally unsafe to give such an extreme bend to the 
end of a regression cutv'e as this would call for, on the basis of so few 
observations. The larger grouping will therefore be used in this case, 



Fig. 14.6. Temperature and yield of com adjusted to average rainfall and year, and 
first approximation cun'c fitted to the averages. 




222 Multiple Curvilinear Regressions 

leaving the subsequent approximations to determine whether the more 
extreme bend is justified 

The bne of averages in Figure 14 6 indicates that yields may tend to 
rise as temperature increases up to between 73 and 75 degrees, and then 
to fall as the temperature goes stiU higher A smooth curve is therefore 
drawn m. averaging out the irregularities shown in the broken line of 
the group averages and conforming to the limiting conditions stated on 
page 213 It does not make much difference if these first approximation 
curves arc not drawn in m exactly the nght position or shape, as the 
subsequent operations will tend to correct them to the proper shape if 
the original one is incorrect It is for that reason that fairly accurate 
results can be secured by this graphic process, even though the true shape 
of the curves is not known at the beginning 

Estimating A', from the First Approximation Curves We have 
now arrived at first approximations to the net regression curves for ATj, 
against each of the three factors It must be remembered that in making 
the adjustments on Xi to arrive at these curves, only the net linear effects 
of the other independent variables have been ehminated Now that we 
have at least an approximate measure of the curvilinear relations of AT, 
to the independent variables, making adjustments to eliminate these 
approximate curvibnear effects may enable us to determine more accurately 
the true curvilinear relation to each variable 

The first step in the next stage of the process is to work out estimated 
values of Aj based on the curvibnear relations To do this we may 
designate the relation between AT, and ATj shown by the curve in Figure 
14 5 as /alATj) the relation between Afj and X, shown m Figure 14 4 as 
/s{-^s)» and the relation between ATjand A'4 shown in Figure 14 6as/4(Ar4) 
The estimates of may then be worked out by the regression equation 

x: = a; +A(x^) (ha) 

The symbol X’ is used to designate this second set of estimates, just as 
A'l' was used to designate the first set, worked out from the linear regression 
equation The constant a\ 534 is different from the constant Oj, 34 used 
m equation (11 7), its value is given by the formula 


„ nn(x,)+A(x,)+f:m] 

’31 — ^>1 


(14 5) 


To work out nj.si according to equation (14 5), it is first necessary to 
Aork out the value /g(Af2) +f 3 .iX^+f^{X 4 ) for each observation For 
the first observation, for example, ATg = 0, A'g = 9 6, and Xt^lAS 


Determining the Regressions by Algebraic end Graphic AAeihods 223 

From/o'(A'„). given in Figure 14.5. the cur%'c rcadinc for ordinate! corre- 
sponding to a value of 0 for X, is 27.3. For/:(A',). Figure 14.4, the 
ordinate of the curve corresponding to X^—9.(i is 31. 7^ For /.'{AM, 
Figure 14.6,^ the curve ordinate corresponding to A'f = 74.8 is 32.5, 
The value [//A „) -f /-(A^) -f /.(A',)] for the first observation is therefore 
[2/. 3 -}- 31.7 -{- 32.5], or 91.5. The sum of these values for each observa- 
tion is the value required in equation (14.5}. 

Before continuing the process of reading each value from the charts 
for the remaining obscrv'ations. it should be noted that, since many 
observations of each variable hav'c the same values, the same point would 
be read from each chart many times. The process of working out the 
computations can be much simplified by reading each required value 
from each chart once and for all, and recording it so that it can be used 
each time. Since each chart indicates each individual observation for 
each independent variable, only those points for which there are observa- 
tions need be recorded. Carrying out this process, we rnay record the 
functional relations as shown in Tables 14.5. 14.6, and 14.7. which show 
the readings from Figures 14.5, 14.4. and 14.6. respectivcly.*- 

Table 14.5 


Values of Xi Corresponding to Gives Values of A'-, from the First 

Approximation Curve 


A's 


A'e 


X. 

//Aj) 

A; 

f:<x.) 

0 

27,3 

10 

30.8 

20 

32.8 

29 

33.4 

I 

27.8 

11 

31.0 

2! 

33.0 

30 

33.5 

2 

28.2 

12 

31.3 

22 

33.1 

31 

33.5 

3 

28.6 

13 

31.5 

23 

33.1 

32 

33.5 

4 

29.0 

14 

31.7 

24 

33.2 

33 

33.5 

5 

29.4 

15 

31.9 

25 

33.2 

34 

33.5 

6 

29.7 

16 

32.1 

26 

33.3 

35 

33.5 

7 

30.0 

17 

32.3 

27 

33.3 

36 

33.5 

8 

30.3 

IS 

32.5 

28 

33.4 

37 

33.5 

9 

30.6 

19 

32.6 






'= In entering these values it is not worth while reading further th-m the first decimal, 
for the line is not drawn more accuratety than to within 0.1 or OA, The ncomacy 
depends, of course, on the scale; but it is not worth using very' large chans to sesrure 
spuriously high accuracy', when the standard error of any particular point on the curve 
is probably several units and when the curve is only a first approximation, subject to 
subsequent modification. 




224 


MufUfsIe Curvilinear Regressions 


Table 14 6 


Values of Xi Corresponding to Given Values of from the First 
Approximation Curve 


^3 


-*'3 

/,(^3) 


/.w 


/a(^a) 

68 

24 6 

95 

31 5 

10 8 

33 4 

129 

33 3 

69 

25 0 

96 

31 7 

11 0 

33 5 

130 

33 2 

77 

27 1 

99 

324 

11 3 

33 6 

13 6 

32 9 

78 

27 4 

100 

32 5 

II 5 

33 7 

13 9 

32 7 

80 

279 

10 1 

326 

116 

33 7 

141 

32 5 

87 

29 7 

104 

33 1 

12 0 

33 7 

162 

310 

93 

31 0 

106 

33 3 

12 1 

33 6 

16 5 

30 8 

94 

31 2 

107 

33 4 

12 5 

33 5 




1 

0 

Table 147 

Corresponding to Given Values of A'4 
Approximation Curve 

FROM THE First 


/.TO 



^4 

f,W 

Xi 

/.TO 

69 9 

30 2 

73 0 

32 5 

74 2 

32 8 

75 7 

316 

710 

31 0 

73 2 

32 6 

74 3 

32 7 

75 8 

315 

71 5 

314 

73 3 

326 

74 6 

32 6 

76 0 

313 

71 9 

31 7 

73 6 

32 7 

74 8 

32 5 

762 

310 

72 0 

31 8 

73 7 

32 7 

75 0 

32 3 

76 9 

301 

72 6 

32 2 

74 0 

32 8 

75 2 

321 

77 6 

29 0 

72 8 

32 3 

74 1 

32 8 

75 3 

32 0 

78 4 

27 6 

72 9 

32 4 








The values to determine 0^234 may now be worked out m orderly 
manner, as shown m Table 14 8, m the fourth to the seventh columns 
This computation gives us the sum of the respective functional values 
for the SS observations Substituting this sum and the number ofobserva- 
tions in equation (14 5), we find the required constant to be 

= 916--?^^= -63 397 

Since the functional values for our regression equation are expressed only 
to one decimal point, we shall use —63 4 for 0^234, which will result in 
the estimated values being 0 003 umt too low, on the average 





225 


Determining the Regressions by Algebraic and Graphic Methods 


Table !4.8 

COMPUTATTON’ OF FUNCTIONAL VaLLTS CORRESPONDING TO InDFPENDFNT 

Variables, of the Estimated Value of Xj, and the New Residual, 
FOR Each Observation 


1 






r' 

T: 

(9) 

Vi - A ; 

r ' 

(10) 

0 

9.6 

74.8 

27,3 

31.7 

32.5 

91.5 

28.1 

24.5 

-3.6 

1 

12.9 

71.5 

27.8 

33.3 

31.4 

92.5 

29.1 

33.7 

4.6 

2 

9.9 

74.2 

28.2 

32.4 

32.8 

93.4 

30.0 

27.9 

-11 

3 

8.7 

74.3 

28.6 

29.7 

32.7 

91.0 

27.6 

27.5 

- O.I 

4 

6.8 

75.8 

29.0 

24.6 

31.5 

85.1 

21.7 

21.7 

0 

5 

12.5 

74.1 

29.4 

33.5 

32.8 

95.7 

32.3 

31.9 

-0.4 

6 

13.0 

74.1 

29.7 

33.2 

32.8 

95.7 

313 

36.8 

4.5 

7 

10.1 

74.0 

30.0 

32.6 

32.8 

95.4 

310 

29.9 

-11 

8 

10.1 

75.0 

30.3 

32.6 

32.3 

95.2 

31.8 

30.2 

-1.6 

9 

10.1 

75.2 

30.6 

32.6 

32.1 

95.3 

31.9 

32,0 

O.I 

10 

10.8 

75.7 

30.8 

33.4 

31.6 

95 .8 

32.4 

34.0 

1.6 

11 

7.8 

78.4 

31.0 

27.4 

27.6 

86.0 

22.6 

19.4 

-3.2 

12 

16.2 

72.6 

31.3 

31.0 

312 

94.5 

31.1 

36.0 

4.9 

13 

14.1 

72.0 

31.5 

32.5 

31.8 

95.8 

32.4 

30.2 

—2.2 

14 

10.6 

71.9 

31.7 

33.3 

31.7 

96.7 

33.3 

32.4 

-0.9 

15 

10.0 

74.0 

31.9 

32.5 

32.8 

97.2 

33.8 

36.4 

2.6 

16 

11.5 

73.7 

32.1 

33.7 

32.7 

98.5 

35.1 

36.9 

1.8 

17 

13.6 

73.0 

32.3 

32.9 

315 

97.7 

34.3 

31.5 

-18 

18 

12.1 

73.3 

32.5 

33.6 

316 

98.7 

35.3 

30.5 

-4.8 

19 

12.0 

74.6 

32.6 

33.7 

316 

98.9 

35.5 

313 

-3.2 

20 

9.3 

73.6 

32.8 

31.0 

317 

96.5 

33.1 

34.9 

1.8 

21 

7.7 

76.2 

33.0 

27.1 

31.0 

91.1 

27.7 

30.1 

2.4 

22 

11.0 

73.2 

33.1 

33.5 

316 

99.2 

35.8 

36.9 

l.I 

23 

6.9 

77.6 

33.1 

25.0 

29.0 

87.1 

23.7 

26.8 

3.1 

24 

9.5 

76.9 

33.2 

31.5 

30.1 

94.8 

31.4 

30.5 

-0,9 

25 

16.5 

69.9 

33.2 

30 .S 

30.2 

94.2 

30.8 

33 J 

15 

26 

9.3 

75.3 

33.3 

31.0 

32.0 

96.3 

319 

29.7 

-3.2 

27 

9.4 

72.8 

33.3 

31.2 

313 

96.8 

33.4 

35.0 

1.6 

28 

8.7 

76.2 

33.4 

29.7 

31.0 

94.1 

30.7 

29.9 

-0.8 

29 

9.5 

76.0 

33.4 

31.5 

31.3 

96.2 

32.8 

35.2 

14 

30 

11.6 

72.9 

33.5 

33.7 

314 

99.6 

36.2 

38.3 

11 

31 

12.1 

76.9 

33.5 

33.6 

30.1 

97.2 

33.8 

35.2 

1.4 

32 

8.0 

75.0 

33.5 

27.9 

313 

93.7 

30.3 

35.5 

5.2 

33 

10.7 

74.8 

33.5 

33.4 

315 

99.4 

36.0 

36.7 

0.7 

34 

13.9 

72.6 

33.5 

32.7 

32.2 

98.4 

35.0 

26.8 

-8.2 

35 

11.3 

75.3 

33.5 

33.6 

310 

99.1 

35.7 

38.0 

13 

36 

11.6 

74.1 

33.5 

33.7 

318 

100.0 

36.6 

31.7 

-4.9 

37 

10.4 

71.0 

33.5 

33.1 

31.0 

97.6 

34.2 

316 

-1.6 

Totals 



1,208.4 

1,204.2 

1,209.3 

3,621.9 





226 Mu/t/pfe Curvilinear Regrestions 

It IS now possible to complete the process of computing .Vj, the estimated 
value of Xi. using the first approximation curves, according to equation 
(14 4), and the constant which has just been computed When equations 
(14 4) and (14 5) arc compared, it is evident that, except for the constant 
term, X{ is equal to the \alucs that have just been computed in the seventh 
column of Table 14 8 Accordingly, all that is necessary is to subtract 
63 4 from each of those values This step is shown also in Table 14 8, 
in the eighth column 

The column headed A'J shows the estimated values obtained by this 
process The next step is to sec whether the new estimates come any 
nearer to reproducing the observed values of Xi than did the first set of 
estimates, based on the linear regression equation We therefore compute 
a new set of residuals, s', by subtracting the new estimates from the 
actual values of X^ This step, also, is shown m Table 14 8 

z'^Xi-X{ (14 6) 

If the individual residuals shown are compared with the residuals 
obtained by the linear regression, as computed m Table 14 1, jt will be 
seen that m general the new residuals are smaller than the previous ones, 
though the reverse is true in many cases There are 23 cases m which the 
new residual is smaller, and 15 m which it is larger than the original 
residual A more accurate comparison can be obtained by companng 
the adjusted standard deviations of the residuals for the two sets If we 
assume that the curves involving X, and A*, each use up three degrees 
of freedom while the relationship to X^ requires four, the multiple curvi- 
linear function as a whole involves eleven degrees of freedom, compared 
VI ith four degrees for the multiple linear regression For the linear correla- 
tion, the adjusted standard deviation of the residuals was 3 8 bushels, 
whereas the adjusted standard deviation of the new residuals is 3 5 bushels 
Apparently the new estimates do come nearer to the observed values, on the 
average, than did the first set of estimates 

Determ/n/ng the Second Approximation Net Regression Curves. 
The regression curves used in constructing the estimate Xl were only the 
first approximations to the true curvilinear relations, since they were 
determined by eliminating only the linear effects of the other independent 
factors Now that the residuals obtained by the use of the first approxi- 
mation curves have been computed, however, we can determine whether 
any change in the shape of the several curves is necessary 

To do this we construct Figure 14 7 by drawing in the regression curve 
from Figure 14 5, using the same scale as before Use of Table 14 5 makes 
It easier to reproduce the curve Next we plot each of the last residuals 
as a deviation just as before, except that now the residuals are plotted 



Determining the Regressions by Algebraic and Graphic /Aetkods 227 

as deviations from the regression curve, instead of from the regression 
line, at the point corresponding to the independent variable aC. Thus 
the first observation, with X. = 0, has =' = - 3 . 6 . The point* on the 
curve corresponding to X„ = 0 is 27.3; so the dot has for ordinate 
27.3 — 3,6, or 23.7. The values for the nc.xt obsers-atton are 27 = i and 
c" = 4.6. The corresponding value of /^(.X;) is 27.8, so the ordinate 



Fig. 14,7. Time, and yield of corn adjusted to average temperature and rainfall 
on basis of first approximation cunes, and second approximation to /jtAV). 


for the dot is 27.8 4- 4.6, or 32.4. The coordinates for this dot arc therefore 
1 and 32.4. The remaining observations arc plotted in the same manner, 
shortening the process by scaling the value for a'’ directly above or below 
from the corresponding point on the regression curve. 

With the dots all plotted, it is evident that the scatter is too great to 
indicate definitely changes which may be needed in the curve, if any, 
simply from the dots alone. Accordingly the residuals are averaged in 
groups, employing the same grouping as before (Table 14.3), which 
eliminates the need of averaging the corresponding 27 values over again. 
The new averages w'ork out as show'n in Table 14,9. 


Table 14.9 

Average Values of =7 for CoRRESPosorKO 27 Values 


27 Values 

Number of Cases 

.Average of 27 

Average of =' 

0- 7 

8 

3.5 

+0.10 

8-15 

8 

11.5 

+0.16 

16-23 

8 

19.5 

-O.OS 

24-31 

8 

27.5 

+0.64 

32-37 

6 

34.5 

-1.08 



228 


Multiple Curvilinear Regressions 

The averages are next plotted as deviations from the first approximation 
curve They indicate that a sh^ raise m the lower part of the curve may 
be needed, and a downward bend toward the end It appean that now 
that the influence of rainfall and temperature on yield have been more 
accurately allowed for, the upward trend with time is slightly less than 
It seemed before m the early years, and the trend seems to have turned 
downward toward the end of the senes— the exact year or extent of the 
turn IS indeterminate A new curve is therefore drawn in in Figure 14 7, 
and, as it happens, a smooth, continuous curve can be drawn exactly 
through each of the first three group averages, but not having the extreme 
bend indicated by the last two group averages ** 



Fig 148. Rainfall, and yield of com adjusted to average temperature and tune 
on the basis of first approximation curves, and second approximation to 

The same process may now be applied to ^3, to see if any change 
need be made in the first regression curve for the change m A'l with changes 
m that variable This process is carried out as shown in Figure 14 8, the 
first approximation curve being drawn in just as before, using the data 
given in Table 14 6 

Instead of plotting the individual residuals for each observation, as 
was just done with respect to A’,, we may proceed at once to compute 
the average residuals for each of the groups of values of X3, since it is 
sufficiently apparent from Figure 14 7 that the scatter of the individual 
observations is still too great to serve as a guide in correcting the first 
approximation curves Averaging the residuals gives the averages shown 
in Table 14 10 

•* As Figure 14 7 shows, it is the comadence of 3 low years out of the last 4 in the 
senes that provides the basis for this downturn This is really too small a sample to 
provide a firm basis for changing a curve 



229 


Determining the Regressions by Algebraic and Graphic //-tfiods 

Table 14.10 

Average Values of for Correston-disg A', Values 


Xz Values 

Number 
of Cases 

Average 

ofA'a 

Average 

ofc' 

Average 
of A', 

Average 

ofr-'' 

Under 8.0 

4 

7.30 

+0.58 



8.0- 9.9 

10 

9.19 

+0.03 



10.0-10.9 

8 

10.35 

-0.15 1 



11.0-11.9 

5 

11.40 

+0.48 ^ 

10.75 

+0.09 

12.0-13.9 

8 

12.76 

-1.11 i 



14.0 and over 

3 

15.60 

+1.73 1 

13.53 

1 

p 


Again the averages are somewhat irregular when plotted, so the last 
four groups are reduced to tw'o, and the new averages plotted and indicated 
separately. The number of obseiwations represented by each of the first 
set of averages is indicated next to it, so that averages based on a small 
number of obsers'ations svill not be given undue weight in drawing in the 
curve. It might be desirable in some cases, also, to try regrouping the 
cases into different groups— say from 8.5-9.4, 9.5-10.4, etc.— and sec 
if that would change at all the indications as to the shifts needed in the 
first curve. Working that out in this case, the changes needed are still 
found to be about the same as shown by the group averages in Figure 14.8, 



m l I I I I i I I i i I ! ! ! I I 1 I L 

69 70 71 72 73 74 75 76 77 78 


Tempe'ature (deg'ees). A'^ 

Fig. M.9. Temperature, and yield of com adjusted to average rainfall and time 
on the basis of first approximation curves, and second approximation to/ifTt). 

though somewhat less regular, owing to the smaller size of groups. A 
new curve is then drawn in freehand, as indicated by the group a% erases, 
rising somewhat higher than formerly at both ends, and not rising quite 
so high in the central portion as before. 

Turning to the relation between .Yj and A 4 . the first approximation 
curve for is reproduced in Figure 14,9, using the values given in 



230 Multipit Curfllinear Regresilom 

Tabic 14 7 The next step is to axcrage the values of z" for corresponding 
>alues of Yf Using the same groupings used in Table 14 4, ^ve amvc at 
the averages shown m Table 14 II 

Plotting these new averages, and connecting them by a broken line, 
wc sec that the relation of jicid to temperature may be quite difrcrcnt 

Table 14 II 


Average Values of 2'. for Correspo'jding A”, Values 


Xt Values 

Number of Cases 

Average of X^ 

Average of z* 

Under 72 0 

4 

71 08 

+ 1 15 

72 0-72 9 

5 

72 58 

-0 36 

73 0-73 9 

5 

73 36 

-0 58 

74 0-74 9 

10 

74 30 

-0 86 

75 0-75 9 

7 

75 33 

+0 63 

76 0-76 9 

5 

76 44 

+0 90 

77 0 and over 

2 

78 00 

-005 


760 and over 7 76 89 +0 63 


from the way it appeared on the hrst approximation Apparently the 
highest yields are obtained around 75 to 75 degrees, instead of at 74 
degrees, higher temperatures appear to reduce the yield markedly, but 
lower temperatures have only a slight influence on the yield These 


Table 14.12 

Values of Xi Corresponding to Given Values of AT*, from the Second 
Approximation Curve 


ATj 



y^fATj) 


nw 

Xt 


0 

27 4 

10 

31 0 

20 

32 7 

29 

33 6 

I 

27 9 

11 

31 2 

21 

33 0 

30 

33 5 

2 

28 4 

12 

31 4 

22 

33 2 

31 

33 4 

3 

28 8 

13 

316 

23 

33 3 

32 

33 2 

4 

29 2 

14 

31 8 

24 

33 4 

33 

33 0 

5 

29 5 

15 

320 

25 

33 5 

34 

32 8 

6 

29 8 

16 

321 

26 

33 6 

35 

32 6 

7 

30 2 

17 

32 3 

27 

33 7 

36 

32 4 

8 

30 4 

18 

325 

28 

33 7 

37 

32 2 

9 

30 7 

19 

32 6 










Dstermining the Regressions by Algebraic and Grcpbic A*tthods 23 [ 

indications are all v.ithin the theoretical limitations on the shape of th- 
cun-e, as stated on page 213. The new cur^e. drassn in freehand so as to 
pass as nearly through these nev. averages as possible and still maintain 
a smooth continuous shape, with only a single maximum, expresses these 
relations. 

Estimating A'l from the Second Approximation Curves. Not*, that 
the second approximation curves have been determined for each variable, 
we can proceed to estimate values of A'j on the basis of the revised curves, 
to see whether the new curves enable us to estimate A'l any more accurately 
than the first set of curv'es did. To facilitate the process we first construct 
Tables 14.12, 14.13. and 14.14 for/ofA'j), and /Tf Ah'), show inc the 
readings for the functions from the revised curves. 


Table 14.13 

Values of Afj Corresponding to Given Values of .V^, from the Second 

Approximation Curvt. 


^3 


^3 


■^3 


A'3 

/sf-Vj) 

6.8 

25.5 

9.5 

31.5 

10.8 

33.3 

12.9 

33.0 

6.9 

25.7 

9.6 

31.7 

11.0 

33.4 

13.0 

33.0 

7.7 

27.5 

9.9 

32.2 

11.3 

33.4 

13.6 

32.8 

7.8 

27.8 

10.0 

32,3 

11.5 

33.3 

13.9 

32.7 

8.0 

28.2 

10.1 

32.5 

11.6 

33.3 

14.1 

32.7 

8.7 

29.9 

10.4 

32.9 

12.0 

33.2 

16.2 

32,2 

9.3 

31.1 

10.6 

33.1 

12.1 

33.2 

16.5 

32.1 

9.4 

31.3 

10.7 

33.2 

12.5 

33.1 




Table 14.14 

Values of A'j Corresponding to Given Values of A',, from the Second 

Approximation Curve 

A, /^CX,) X, fHX,) X, fW X^ fllX,) 


69.9 

31.6 

73.0 

32.0 

71.0 

31.7 

73.2 

32.0 

71.5 

3I.S 

73.3 

32.0 

71.9 

31.8 

73.6 

32.1 

72.0 

31.8 

73.7 

32.1 

72.6 

31.9 

74.0 

32.2 

72.8 

32.0 

74.1 

32.2 

72.9 

32.0 




74.2 

32.2 

75.7 

32.2 

74.3 

32.2 

75.8 

32.2 

74,6 

32.2 

76.0 

32.1 

74.S 

32.3 

76.2 

32.0 

75.0 

32.3 

76.9 

30.7 

75.2 

32.3 

77.6 

29.1 

75.3 

32.3 

78.4 

27.3 


232 Multiple Curvilinear Regressions 

To simplify the calculations, 20 is subtracted from each of the functional 
values in making subsequent qitncs The computations to determine the 
estunated values are then earned out as shown m detail in Table 14 15, 


Table 14 IS 

Computation of Functional Values from the Second Approximation 
Curves Corresponding to Independent Variables for Each Observation 
AND Computation of Estimated Value for Xi and of New Residuals 



Independent 

Variables 

Correspond ng 
Fund onal Values* 

/i(X,) (7) -a 

+/;ix,) 
i) +/*(X,) 

(7) (8) 

Dependent Xi — Sfr 
Variable i* 

Xt 

(9) (10) 

(1) 

Xt 

(2) 

Xt 

0) 

rtiXt) 

(4) 

/m 

(5) 

A(x. 

(6) 

0 

96 

74 8 

74 

117 

123 

31 4 

28 0 

24 5 


1 

12 9 

71 5 

79 

13 0 

11 8 

32 7 

29 3 

33 7 




74 2 

84 

122 

12 2 

32 8 

29 4 

27 9 


3 

8 7 

74 3 

S8 

99 

12 2 

30 9 

27 5 

27 5 


4 

68 

75 8 

92 

55 

122 

26 9 

23 5 

217 


5 

125 

74 1 

95 

13 1 

122 

34 8 

314 

31 9 


6 

130 

74 1 

9 8 

130 

122 

35 0 

31 6 

36 8 


7 

101 

74 0 

102 

12 5 

12 2 

34 9 

31 5 

29 9 


8 

101 

75 0 

104 

125 

12 3 

35 2 

31 8 

30 2 


9 

10 1 

75 2 

107 

125 

12 3 

35 5 

32 1 

32 0 


10 

10 8 

75 7 

no 

133 

122 

36 5 

33 1 

34 0 


11 

78 

78 4 

11 2 

78 

73 

26 3 

22 9 

194 


12 

162 

72 6 

II 4 

12 2 

119 

35 5 

321 

36 0 

39 

13 

14 1 

72 0 

II 6 

127 

II 8 

36 1 

32 7 

30 2 


14 

106 

71 9 

11 8 

131 

11 8 

36 7 

33 3 

32 4 


13 

10 0 

74 0 

120 

123 

12 2 

36 5 

33 1 

36 4 


IS 

It 5 

73 7 

12 1 

133 

121 

37 5 

34 1 

36 9 

28 

17 

136 

73 0 

123 

12 8 

120 

37 1 

33 7 

31 5 


18 

12 1 

73 3 

125 

13 2 

120 

37 7 

34 3 

30 5 


19 

12 0 

74 6 

126 

132 

12 2 

38 0 

34 6 

32 3 

-2 3 

20 

93 

73 6 

12 7 

II 1 

12 1 

35 9 

32 5 

34 9 


21 

77 

76 2 

130 

75 

12 0 

32 5 

29 1 

30 1 

1 0 

22 

n 0 

73 2 

132 

134 

120 

38 6 

352 

36 9 

I 7 

23 

69 

77 6 

13 3 

57 

9 1 

28 1 

24 7 

26 8 


24 

95 

76 9 

134 

11 5 

107 

35 6 

32 2 

30 5 

-I 7 

25 

16 5 

69 9 

133 

12 1 

116 

37 2 

33 8 

33 3 

-0 5 

26 

93 

75 3 

13 6 

11 1 

123 

370 

33 6 

29 7 


27 

94 

72 8 

137 

II 3 

12 0 

37 0 

33 6 

35 0 

14 

28 

87 

76 2 

137 

99 

12 0 

35 6 

32 2 

29 9 

-2 3 

29 

9 5 

76 0 

13 6 

11 5 

121 

37 2 

338 

35 2 

14 

30 

116 

72 9 

13 5 

13 3 

t2Q 

38 8 

35 4 

38 3 

29 

31 

12 1 

76 9 

134 

13 2 

10 7 

37 3 

33 9 

35 2 

1 3 

32 

80 

75 0 

13 2 

82 

12 3 

33 7 

30 3 

355 

52 

33 

107 

74 8 

13 0 

13 2 

12 3 

38 5 

35 1 

36 7 

I 6 

34 

139 

72 6 

128 

12 7 

119 

37 4 

34 0 

26 8 

-7 2 

35 

11 3 

75 3 

126 

13 4 

12 3 

38 3 

34 9 

38 0 

31 

36 

11 6 

74 1 

124 

133 

12 2 

37 9 

34 5 


-2 8 

37 

104 

71 0 

122 

12 9 

117 

36 8 

33 4 

32 6 

-0 8 

Totals 



447 6 

445 1 

448 7 

1 3414 





Less 20 0 for each functional reading 




Determining the Regressions by Algebraic and Graphic fr’tthads 233 

just as for Table 14.8. In practical computation these entries, for th*' 
second approximation cur\'es, would be made on the same sheet as '«crc 
the entries inTable 14.8 for the first approximation curves, thus eliminatin'’ 
the work of entering the values of A'., X^. and A '4 over again. " 
Table 14.15 is worked out just as was Table M.S. Thus 'the data for 
the first observation show values of 0. 9.6, and 74.8 for Ah. Ah. and X . 
respectively. Looking up the corresponding salucs in Table.s 14.12, 
14.13. and 14.14 gives values of27.4, 31,7, and 32.3. for the three functional 
values. Subtracting 20 from each value, to reduce the subsequent clerical 
work, we enter 7.4, 11.7, and 12.3. in the functional columns. The three 
functional values are then added, and the sum entered in the seventh 
column. The entries for the functional readings are completed as showm. 
and the sum computed for each observation. Then the averace of the 
seventh column is determined, giving the value 35.30. As the averase of 
Xj is 31.916, the value of the new constant, is found by equation 
(14.5) to be _ 3 , _ .. 3 ^;, 

= -3.384 


Accordingly, 3.4 is subtracted from each of the values in column 7 to 
give the estimated value of A'j, X[. which is then entered in the eighth 
column. 

The final step in computing the table is to subtract each of the estimated 
values, Xi, from the actual value A'j, giving the residuals vshich 
appear in the last column. 

Comparing the new residuals, c'’. with the previous ones, c”. given 
in Table 14.9, xve find that their size has been increased in just about as 
many cases as it has been decreased. But when v.e compute the standard 
deviation of the new residuals, we find that the adjusted standard deviation 
of c'" is 3.3 bushels, or slightly smaller than the adjusted standard dc\ iation 
of z", 3.5 bushels, assuming that the new curves also use a total of eleven 
degrees of freedom. 

Correcting the Curves by Further Successive Approximations, The 
process could be carried through one or more additional approximations 
by repteating the steps shown. Thus, the last residuals, z^ , when averaged 
and plotted with respect to the second set of approximation curves, would 
indicate w'hether any further modifications were needed in the curves; 
if any were needed, new readings would be made from the new curx'es. 
new estimates of X-^ obtained from them, and another set of residuals 
determined. 

The number of successive appro.ximations used in any given case ’xould 
depend upon several considerations. In Chapter 10 it was noted that, 
with linear relations, repeated applications of the method of successive 



234 Multiple Curvilinear Regression: 

elimination would approach more and more closely to the net regression 
lines that would be obtained if a multiple linear regression equation were 
fitted to the same data by least squares The successive elimination 
method could never improve upon the least squares norm As long as the 
standard deviation of the residuals continued to decline, we would know 
that the successive steps were bnngmg us closer to the least squares fit 

The method of successive approximations would have a similar norm 
or limit for curvilinear relations only if we held the mathematical form 
of each net regression curve exactly the same during the entire process 
With freehand methods we often change the shapes of the net regression 
curves somewhat from one step to the next, and we may sometimes make 
offsetting changes in the shapes of two net regression curves that leave 
the standard deviation of the residuals unchanged In most practical 
situations, however one can arrive fairly soon at an approximation which 
conforms to the hypothetical bmiiations placed upon the curves and also 
approaches the minimum standard deviation of residuals for the general 
types of curves chosen As there is no definite criterion of best fit m 
freehand curvilinear regression, ii may be advisable to average the curve 
readings from two or three successive approximations after the standard 
error of estimate has approached a stable minimum value to somewhat 
reduce arbitrary elements in the final positions of the curves This lack of 
definiteness has sometimes been regarded as a serious weakness of freehand 
methods But it is equally possible to choose two or more mathematical 
functions that will yield almost identical standard errors of estimate when 
fitted to the same data by the method of least squares The mathematical 
net regression curves will differ at least slightly from one function to 
another, and the selection of any one function as "best” will be arbitrary 
in much the same sense as will the selection of final curves in the freehand 
case 

Stating the Final Conclusions After the final shape of the several 
net regression curves has been determined either by graphic or algebraic 
processes, it still remains to state those curves m such a manner that 
their meaning is perfectly clear The several functions may be stated to 
show the value of the dependent factor associated with given values of the 
particular independent factor when values of other independent factors 
are held at their means There are two alternative ways of stating the 
associated values (1) as actual values, and (2) as deviations from the 
mean values 

To state the associated values as actual values, we may use the following 
procedure for graphic curves 

First, the mean of all the values read from the final curve is determined 
For fz{X^, this mean may be designated The values from the 



Determining the Regressions by Algebraic and Graphic A^.ethcds 235 

cur\'c are read off for selected inter%-als of X,. Then the csdmaicd value, 
of Xt for each of these values of X. (with values of J,. A'., etc., at 'their 
means) are determined by subtracting the mean of the cun c 'rcadin::, 
from each of these actual readings and adding to the result the mean of .V 
That is, if we use Aj = F ;(A'„) to designate these values of A'j. estimated 
from the net curvilinear relation to A',, we can define them by the equation 

•^1 “ ~ + d/| (14,7) 

If, however, the expected values of A\ for given values of A'., arc to be 
stated merely as deviations from the mean \alues, those deviations mav 
be determined by subtracting from each curve reading the mean of ail 
the curve readings. If we use lo designate these expected deviations 
from the mean values, we may define them by the equation 

x[ = Fo(x 2 ) =/o(A 2 ) - (I4.S) 

It is evident, from equations (14.7) and (14.8), that 

F‘^X^ — F ^(xj) + A/j 

In the actual statement of the results of a regression study, it is frequently 
desirable to state the relation of the dependent factor to the most 
important independent factor according to equation (14.7). and to state 
the relation for the remaining independent factors according to equation 
(14.8). When that is done, the estimated values of .Tj. ba^ed on all the 
independent factors, may be readily computed by taking the estimate from 
the most important factor, and then adding to or subtracting from that the 
adjustments to take account of the departures of other factors from their 
means. Using X^ to designate this final estimate of the value ,Vj. and 
taking X^ as the most important factor, we make the estimate by the 
equation 

X; = F,(X2) + F:,(Xo) -f F,(x,) + . . . -b F,(x,} (14.9) 

The process of working out these final statements of the net cundlincar 
regression lines may be illustrated by the data of the corn-yield problem. 
Since the rainfall (Afg) was apparently the most important factor, that may 
be taken as the one for which the regression is to be stated according to 
equation (14.7). If we regard the second approximation curve shown in 
Figure 14.8 and Table 14.13 as the final curve, then Table 14.15 gives the 
readings from this curv'e for each of the individual observations. 

The mean of the readings of fz(X^ is next computed from the values 
of Table 14.15. The sum of the 38 /'(Xs) readings is 445.1. so 




445.1 


= 11.71 


38 



226 MukipJe CunlllKar Regnahtit 

The mean \’alue of is A/j = 31 92. From equation (14 7), 

Vrhich IS 

Ws) - n 71 + 31 92 

=Am + 20 2I 

All that IS necessary, therefore, is lo add the new constant, 202, to 
the values read from the curve This process is shown in Table 14 16 

Table 14 16 


Computation of Average Yitu> op Corn Wmi Varying Rainfall, 
Holding Trend in Yield and Influence of Temperature Constant 


Inches of Rainfall 

X, 

Readings from 
Final Curve • 

f;(x,) 

Constant 

M, ~ 

Average Yield, 

Fm 

7 

60 

202 

262 

8 

82 

20 2 

28 4 

9 

105 

20 2 

307 

10 

12 3 

20 2 

32 5 

11 

134 

202 

33 6 

12 

132 

20 2 

33 4 

13 

130 

20 2 

33 2 

14 

127 

20 2 

32 9 

15 

12 5 

20 2 

32 7 

16 

123 

20 2 

32 5 


• Curve readings minus 20 just as eolered m Table 14 15 


The computation for Fi(z^ follows the same form as that for 
save that equation (14 8 ) is used instead hence the mean of X^ is not 
involved First the mean of all the readings for //A’l) as shown m 
Table 14 15, is computed, giving the value of II 81 The values for 
^*(* 4 ) arc therefore given by the equation 

f.(*4) =r,m - M/cr.) 

=/; to - 1181 

These values are worked out in Table 14 17 

The net correction m the estimated yield to allow for the influence of 
trend can be obtained by carrying lluough a similar computation for 





227 


Determining the Regressions by Algebraic and Graphic Methods 

Table 14.17 

Computation or Deviation of Cop.n Yields f?om Yields OnirR’.'.’isE 
Expected, Because of Differences in Temperature tor Season 


Average 

Temperature, 

Readings from 
Final Curve,* 

fm 

Constant, 

Correction to 
Expected Yield. 

70.0 

11.6 

-il.S 

-0.2 

71.0 

11.7 

-il.8 

-0.1 

72.0 

11.8 

-Il.S 

0 

73.0 

12.0 

-11.8 

0.2 

74.0 

12.2 

-11.8 

0.^ 

75.0 

12.3 

-11.8 

0.5 

76.0 

12.1 

-11.8 

0.3 

77.0 

10.5 

-11.8 

-1.3 

78.0 

8.3 

-11.8 

-3.5 


* Cur\-e readings minus 20, just as entered in Table U.15. 

^ 2 (^ 2 ). The readings for fl{X^ sum to 447.6. so I1.7S. Tnc 

values of F^(x^ are then given by the equation 

=/2(^2) - 11-78 

This computation is carried out in Table 14.] S. 

The conclusions of the study can then be stated as shov n in the last 
column of each of the last three tables, free from all the prcMous details. 

Table 14.18 

Co.mputatton of Deviation of Corn Yields fro.m Ti:o"e Othfrvtse 
Expected, Because of Net Tre'-d in \'rELDs 




Readings from 


Correction to 

Number of Year, 

Date 

Final Cun-e,* 

Constant, 

E.\pectcd 'i'icld. 

JVo 




Ffr.i 

0 

1890 

7.4 

-11.8 

-4.4 

5 

1895 

9.5 

-Il.S 

—2.3 

10 

1900 

11.0 

-Il.S 

-0.8 

15 

1905 

12.0 

-11.8 

0.2 

20 

1910 

12.7 

-11.8 

o.« 

25 

1915 

13.5 

-11.8 

1.7 

30 

1920 

13.5 

-11.8 

1.7 

35 

1925 

12.6 

-11.8 

o.s 


Curve readings minus 20. 



238 


/Auliiple CurvlUntar Regttsthnt 

The relations for each of the \anables can also be combined lo show the 
expected or estimated yield for vanous combinations of the independent 
factors Thus for the present case, it might be desired to combine the 
findings into a table showing the expected or probable yield for any guen 
combirtation of rainfall and temperature, with the 1927 trend of yield 
These values can be obtained by taking the trend correction for 1927. 
plus 0 4 (the 12 2 under /sfA'*) in Table 14 15 minus the constant 11 8 m 
Tabic 14 18), and combining it with the estimated influence of vanous 
quantities of rain and degrees of temperature These estimates would then 
be defined by equation (14 9) 

= fkfxj) + FJiX^) + FJx,) 

= 04 + 

Combining the readings for F^iX^ from Table 14 16 with those for Ft{x^ 
from Table 14 17, and adding m the correction for as just stated, 

we obtain estimated yields as shown m Table 14 19 


Table 14 19 

Estimated Yield of Corn, in Busheu per Acre. With Varyino Raintall 
AND Temperature Conditions, for 1927 


Inches of 


Average Temperature! 


Rainfall* 

70* 

72* 

74* 

76* 

78* 

7 

j 

♦ 

27 0 

26 9 

23 1 

9 

30 9 

31 1 

31 5 

31 4 

♦ 

11 

33 8 

34 0 

34 4 

34 3 

* 

13 

33 4 

33 6 

34 0 


* 

IS 

32 9 

33 1 

i 

X 

; 


* Total for June July and August, average for nine Com Belt stations 
t Average for June July, and August at same rune stations 
X This combination of factors was not represcnled in the observations analyzed 

In preparing a table such as Table 14 19, we should not enter values for 
combinations of the several factors which were not represented m the 
data on which the relations were based Examination of a dot chart of 
the lelation between rainfall and temperature, for the data included in 
the analysis, shows that no combinations of rainfall below 9 inches and 
temperature below 74 degrees appeared m the record, and no cases of 
temperature above 78 degrees with rainfall above 9 inches occurred 
Accordingly, these combinations, and other combinations which were 
“Table 14 19 may be compared with the results secured by cross-classifying and 
averaging the same data, by the methods of Chapter 23 





Determining the Regressions by Afgcbraic and Graphic Methods 


239 


not represented, arc left blank in the table, as shown. {A more exact 
method for measuring the rcprcscntativene.ss of tlic relations is mferied 
to in Chapter 19, on page 324.) 

By combining a table such as Table 14.19 with a statement of the 
c.xtent to which yields averaged higher or lower than those shown at 
different times through the period, all the conclusions from the study can 
be presented in simple form, easy to understand. 




+ 4 
+2 
0 

-2 
-4 

70 72 74 76 78 80 



! 


j 


! Allowance 


temperature i 

1 ; \ 


Temperature (degrees) 



Fig. 14.10. Relation of yield of com to rainfall, temperature, and time. 

The final results of curvilinear correlation studies, after being simplified 
to the form shown in Tables 14.16 to 14.18. or in Table 14.19, may also 
be expressed graphically for final publication. Thus all three relations 
might be combined into a single figure, such as Figure 14.10, to present 
in relatively simple form the final conclusions reached by the statistical 
analysis.^° 

A three-dimensional cliart illustrating Table 14.19 is shown on page 349. 





240 Multiple Curvilinear Regressions 

It nught be noted at this point that Table 14 19 is much more than 
merely a table of average yields for various rainfall and temperature 
groups There were only 38 observations to begin with, and only 14 
of those were under 74 degrees temperature If these 14 observations had 
been grouped according to year and rainfall, and the average yield deter- 
mined for each class, only the roughest sort of groups could have been 
made, and even then the averages would have had but little rehabihty 
As the result of the correlation study, however, all 38 observations have 
been drawn on to determine the relations The table shows the yield most 
likely to be received with any of 16 different combinations of rainfall 
and temperature, for the trend in 1927 Other estimates could be shown 
for a large number of other combinations Furthermore, it is known that 
estimates made from such tables agreed with the actual yields to within 
2 8 bushels in about two-lhirds of the original cases The reliability of 
these estimated yields is thus greater than it would be for any average of 
a few cases alone This example illustrates the abihty of regression analysis 
both to bring out of a series of observations relations which are not 
observable on the surface, and to provide a basis for estimating the 
probable effect on the dependent factor of new combinations of the 
independent factors 

It should be noted that none of the three net regression curves in this 
example could be approximated at all satisfactorily by straight lines 
The constants of the multiple linear regression equation on page 215 
show only whether the net slope of the relationships were preponderantly 
positive or negative It would be quite possible to obtain a linear partial 
regression coefficient of zero if the true curvilinear relationship were a 
second-degree parabola with its maximum near the mean of the observed 
data 

In the present example, the final net regression curves are not drastically 
different from the simple regression curves based on the original lines of 
averages (Figures 14 3, 14 4, and 14 5) The regression of yield on tempera- 
ture showed the most substantial alteration, but judging from the scatter 
of residuals about the final curve (Figure 14 11), it was the least reliably 
determined from a sampling standpoint In other examples more striking 
changes might be found, similar to the differences noted between simple 
and partial linear regressions m Chapters 10 through 13 

Limitations on the Use of the Results. The results of the com-yield 
analysis apply only to the same area from which the data were drawn 
and to the period which they covered Thus they provide no basis for 
estimating com yields in other sections, and their use in estimatmg yields 
m other periods — as in subsequent years — is attended by incieasing risk 
due especially to the necessity of extrapolating the trend regression 



241 


Determining the Regressions by Algebraic and Graphic Methods 

Although this may give fair results for a year or two, it may tend to become 
increasingly inexact. For example, it may be that the trend of vicld did 
not rea_lly turn downward about 1920. but only flattened out-addiiional 
years of observations arc needed to tell which is correct. 



Fig. 14.11. Comparison of apparent relation of com yields to temperature with 
net relation after eliminating influence of rainfall and of trend in yield. 

Sufficient time has elapsed since the publication of the first edition, 
which included the preceding paragraph, to afford a check on the stability 
of the corn-yield relationships over the years. From the standpoint of 
stability, these relationships are vulnerable to the peculiar hazards 
attending time series which are subject in part to human control. These 
hazards will be further discussed in Chapter 20. They are also subject 
to sampling variability, which has been ignored in our treatment of 
multiple regression analysis up to this point but 'Ahich will be discussed 
in Chapters 17 through 19. 

Table 14.20 presents a continuation for 1928-1956 of the same series 
given for earlier years in Table 14.1. The time variable has. however, 
been renumbered, starting w'ith 1928 as the first year. The yields as 
estimated from the 1890-1927 relationships were obtained as in Table 
14.15; the trend, X„, was extrapolated assuming a continued decline of 
0.2 bushel a year, as indicated in Figure 14.10 for the period 1920-1927. 

The basic data in Table 14.20 may be used as an exercise, starting with 
a simple plotting of rainfall, temperature, and yield as in Figure 14.2. 
It will be noted from such a figure that the average levels of the rainfall 
and temperature series are approximately the same as in the earlier period 
and that, as before, there is a fairly high negative correlation between 
rainfall and temperature. These two variables may apparently be regarded 
as “sampled” from the same statistical universe in both periods. HoNvever, 
from about 1939 on corn yields were substantially and continuouily 




242 Mu/t/p/e Curvilinear Regressions 

higher than would have been expected on the basis of 1890-1927 conditions 
It IS also clear that, except for temperature effects in 1934 and 1936, 
“time” accounts for a mudx greater proportion of the 1928-1956 vanation 


Table 14.20 

Yield of Corn, Rainfall, and Temferature in Six Leading States, 
1928-1956 


Year 

Time, 

•Y, 

Rainfall, 

Y, 

Temperature. 

Y« 

Yidd, 

Y, 

Yield Estimated 
from 1890-1927 
Relationships, 
Yi 

Difference, 

Y, - x; 



/ncA« 

Degrees 

Bushels 



1923 

1 

IS 1 

72 8 

33 4 

33 1 

03 

1929 

2 

106 

73 4 

31 5 

33 7 

“2 2 

1930 

3 

64 

16 A 

258 

24 3 

15 

1931 

4 

104 

769 

32 7 

31 7 

10 

1932 

5 

13 S 

760 

35 4 

32 7 

27 

1933 

6 

72 

773 

294 

24 0 

54 

1934 

7 

75 

800 

189 

174 

15 

1935 

8 

96 

762 

31 7 

306 

1 1 

1936 

9 

49 

800 

18 S 

106 

79 

1937 

10 

101 

766 

364 

30 6 

58 

1938 

11 

129 

762 

365 

313 

52 

1939 

12 

124 

75 6 

41 3 

319 

94 

1940 

13 

91 

754 

381 

289 

92 

1941 

14 

98 

758 

42 8 

30 0 

128 

1942 

15 

134 

74 4 

49 1 

30 9 

182 

1943 

16 

119 

766 

441 

301 

14 0 

1944 

17 

no 

75 5 

42 2 

312 

no 

1943 

18 

102 

IZA 

41 2 

29 6 

116 

1946 

19 

12.3 

TiZ 

47 8 

30 4 

17 4 

1947 

20 

12 2 

75 2 

32 2 

30 3 

1 9 

1948 

21 

119 

141 

53 8 

30 0 

23 8 

1949 

22 

12 8 

16 Z 

45 8 

291 

167 

1950 

23 

143 

714 

465 

28 6 

179 

1951 

24 

149 

729 

43 7 

28 5 

152 

1952 

25 

10 8 

771 

525 

27 5 

25 0 

1953 

26 

77 

772 

47 0 

21 5 

255 

1954 

27 

122 

774 

46 2 

26 3 

199 

1955 

28 

94 

76 3 

466 

25 8 

20 8 

1956 

29 

11 5 

75 5 

52 9 

28 7 

24 2 


Source Computed from June, July, and August records for nine weather stations in 
Com Belt states Stations averaged include Kansas City, St Louis, Toledo, Omaha, 
Peoria, Cmcinnati, Topeka, Indianapolis, and the Iowa state average, as in the original 
study 



243 


Determining the Regressions by Algebraic and Graphic Methods 

in corn yields than do rainfall and temperature combined. This is clearly 
indicated by the strong trend in tlic residuals in Tabic 14.20: these rcsiduafs 
evidently result from factors or/ter r/tan temperature and rainfall. 

In view of this circumstance it seems desirable in anah 7 inc the 
new observations to start out with the relation betsseen corn vtcld and 
time. Wc have knowledge in addition to the yield scries itself to preside 
some support in this analysis. 

First, we know that hybrid seed corn became commercially important 
beginning in the middle !930’s. Controlled experiments in the various 
Corn Belt states have indicated that hybrid corn 3'iclds at least 20 per cent 
more bushels per acre than the open-pollinated varieties used during 
1890-1927. From 1934 to dale we have annual estimates of the acreage 
of corn planted with hybrid seed in the leading states (see Table 14.21). 
Starting from 1 percent of total corn acreage in 1934, hybrids were planted 
on 94 per cent of the acreage in 1945. By the latter year, the effect of 
hybrid seed on corn yields in the 6 states must have been approximately 
6 bushels per acre (94 per cent times 20 per cent times 33.2 bushels, the 
average yield during 1923-1927). 


Table 14.21 

Use of Hvaaro Seed Corn and Comstcrcial Fertilizer, 
Selected Years. 1930-1955 


Year 

Percentage of Corn 
Acreage Planted 
with H)brid Seed, 

6 States 

Commercial Fertilizers 
Used in the East North 
Central and West North 
Central Regions 

1930 

Per cent 

Plant nutrients, 
rbninonds of tons 

185 

1935 

2.6 

162 

1940 

60.8 

252 

1945 

94.5 

598 

1950 

9S.3 

1,255 

1955 

99.2 

2,246 


Second, starling about 1940 the use of commercial ferlilircr on corn, 
previously unimportant in the 6 states, increased very rapidly. Figures 
for an area which includes the 6 stales arc shown in Table 14.21: the 
rate of change in fertilizer use in the 6 states would be very similar. 
Experimental studies of the response of com yields to fertilizer do not 



244 Multiple Curvilinear Regressions 

give us as accurate a figure as we have for the effects of hybrid seed 
However, scattered studies surest that the yield increase in the 6 states 
resulting from greater use of fertilizer was on the order of 10 bushels 
per acre between the 1930 s and the niid-1950*s 
Thud, we have the analysis of the effects of rainfall and temperature 
upon corn yields presented earlier in4he chapter This gives us a basis 
for adjusting trend yields, particularly in the drouth years 1934 and 1936, 
for the probable effects of rainfall and temperature, thereby refining our 
initial approximation to the appropriate freehand trend 
The averages of corn yields, rainfall, and temperature for successive 
periods are given in Table 14 22 If we examine the net regression curves 


Table 14.22 

Averages of Corn Yields, Rainfall, and Temperature, 
Selected Periods, 1928-1956 


Period 

Number 

of 

Cases 

Average 
Yield 
of Com 

Average 

Rainfall 

Average 

Temperature 

1928-1932 

5 

3176 

112 

751 

1933-1938 

6 

28 57 

87 

77 7 

1939 1944 

6 

42 93 

11 3 

75 6 

1945-1950 

6 

44 55 

12 3 

74 D 

1951-1956 

6 

48 15 

11 1 

761 


m Figure 14 10 it appears that the average of rainfall and temperature for 
all periods other than 1933-1938 would normally be associated with com 
yields varying ovei a range of less than a bushel In other words, these 
averages fall on portions of the net regression curves that are nearly 
honzontal But during 1933-1938 both rainfall and temperature were 
strongly adverse to good com yields — their combined effect would be to 
reduce yields about 7 bushels relative to those for the other periods, 
judging from the net regression curves for the earlier penod Furthermore, 
neither hybrid seed nor commercial fertilizer could ^ve had a great effect 
upon the average yield during 1933-1938, it seems probable that the 
earlier relationships were still applicable during that time 
The average yield for 1933-1938 may therefore be raised about 7 bushels 
above the actual average before drawing m a first approximation to the 
trend It should be noted that this adjustment could not have been made 
without the knowledge gained from the 1890-1927 analysis, or related 




Determining the Regressions by Algebraic and Graphic Methods 24S 

knowledge b.ascd on test-plot records from experimental farms in the 
6 states over considerable periods. 

With these preliminaries, the rest of the analysis may be left as an 
exercise for the student. Although the major trend effects hsvc been 
allowed for, it is possible that hybrid seed, commercial fertilizer, and other 
cultural practices may also have affected the net relations between yield, 
temperature, and rainfall. On the graphic level, the student may investigate 
this possibility for himself, by making a separate analysis for the period 
1928-1956, perhaps starting with a first-approximation trend and then 
determining the other two regressions and the final net trend, by successive 
approximations. The resulting net curs-es could then be compared with 
those for 1890-1927. 

Another possibility is to assume that the yield of corn is infiucnced 
by weather conditions as a per cent of the normal yield from improving 
technology. This could be explored by fitting an analysis to the entire 
period using the equation 


log X, = a +/ 3 (A^ 3 ^ 


(with the symbols for titc variables having the same meaning as earlier 
in this chapter) and then relating the values of /.(A;) -f for each 
observation to the changing use of fertilizer and of hybrid seed (so far 
as data w'cre available). This last step would show how much of the net 
trend could be explained by these two variables and correlated ones, 
and how much reflected other independent improvements in technology. 

Reliability and Use of Regression Curves 

The regression curves show the net relation between the dependent 
variable and each independent variable, with the net variation associated 
\Vith the other independent variables held constant, for the particular 
observations included in the sample. If another sample were drawn 
from the same universe, and similar net regression curves were determined, 
they would vary somew'hat from the cuiwes determined from the first 
sample. (It should be noted that the 1890-1927 relationships gave good 
estimates of corn yields for some 8 years after the end of that period, 
sugsesting that the 1928-1935 observations were drawn from essentially 
thc^samc universe— before the cfTccls of hybrid seed and commercial 
fertilizer became important.) The lower the multiple correlation in the 
universe, or the smaller the sample, the larger would be this variation 
between successive samples. Methods have been developed for estimating 
the proportion of such samples which will give regression results failing 



246 Multiple Curvilinear Regresthns 

within given ranges of the true regressions prevailing in the universe 
(See Chapter 17, pages 290-293) In publishing regression results, as 
shown in Tables 14 16 to 14 19, or m presenting charts of the regression 
results, such as Figure 14 10, the reliability range of the regressions should 
be indicated, as shown subsequently Even if the regressions (as m the 
example here) are determined from a time series, and so are based upon 
all the evidence for that portion of the constantly evolving universe, the 
reliability limits may still be used as a preliminary indication of possible 
significance, in view of the closeness with which the relations can be 
•determined (For a more extended discussion of the meaning of sampling 
errors with respect to time, see Chapter 20 ) 

Where net regression curves are fitted as algebraic equations of the 
types illustrated in this chapter [equations (14 1) to (14 3)], or to other 
equations with linear coefficients, the coefficients for the equations can 
be computed by electronic calculators just as readily as they can for linear 
equations, as discussed m Chapter II The machines can perform the 
necessary calculations for the extensions and solve the equations so fast, 
once a method of “programming” the successive operations has been 
worked out, that it is possible to try a vanety of alternative types of 
equations for the curves and to see empirically which ones give the best 
fit It IS also possible to compute the standard error of the coefficients 
for each term m the equations at the same time, and then by dropping 
out the terms which show coefficients of nonsignificant size compared with 
their standard errors and solving the equations with these terms omitted, 
to fit algebraic functions all of whose terms show significant values 
(This may, however, affect the meaning of the calculated standard errors 
— see Chapter 17, page 295 

There are many problems, however, where initial exploration by the 
successive approximation method is very valuable for deciding what the 
net regression curves present are really like, even if subsequently they are 
fitted by algebraic equations suitable to represent those types Also, 
even where there is some specific hypothesis as to the type of curve or 
curves that should be present, it may be useful to examine the data, 
unconditioned by any specific type of assumed equation, to see if the 
apparent net relations do conform at all to those logically expected, or 
if the facts are strongly m contrast with the expectations With problems 
involving a relatively small number of observations, this exploratory fitting 
can readily be done by the short-cut method described in Chapter 16 
Where the samples are large, however, or where the number of variables 
which needs to be considered is large (say five or more) the longer but 
more exact successive approximation method of Chapter 14 may yield 
better results 



Determining the Regressions by Algebraic and Graphic Methods 247 

Use of Electronic Calculating Machines in Computing 
Multiple Curvilinear Regressions 

For such cases, either card-tabulation macliincs equipped with auto- 
matic multipliers, or electronic calculators could be used to urcatlv 
accelerate the successive approximation process. This would involve 
the following steps. 

1. Computation of the extensions, and then (for electronic calculator!;), 
solution of the linear equations. 

2. Computation of the values of A'j and e for each observation, and 
entering them on the tape or card for each. 

3. Sorting the observations into groups for each independent variable 
in turn, and computing the average values of the independent variable and 
= for each of these groups. 

4. With these values computed, charts such as Figures 14.3, 14.4. and 
14.5, would be made by hand, and first appro.ximation curves fitted to 
each, if indicated. With card tabulators, values of A" and of z' could then 
be calculated by hand, and entered on the cards; withelcctroniccalculators 
and large samples, the machines could be “instructed” concerning each 
net regression first-approximation function, and they could then calculate 
the values of the several functions corresponding to the value of each 
independent observation for each variable. In cither case, the card 
tabulator or electronic machines could then sum them to obtain A',', 
subtract from A'l to obtain z", and calculate the 5.-1, 

5. With these new values of z", the entire process could be repeated 
as under 4, and carried through as many successive approximations as 
was found necessary to achic\e stability in all the curves and reduce 

to a minimum consistent with logical restrictions on the shape of the 
regression cun’es. At each step the s. would be calculated by the machines, 
and adjusted by hand to Si to determine tlic progress made, if any, in 
getting a better fitting set of net regression curves. 

By this machine-aided process, the successive approximation operation 
could be carried out quite rapidly (assuming the machines to be available 
to work with the investigator immediately as soon as each hand operation 
was completed), so that in a relatively short time a set of graphic curves 
could be fitted to a relatively large sample for a number of independent 
variables, with the investigator having as close and intimate a touch witli 
the process, and as much awareness of the kind of curves being obtained, 
as in the full hand-operated procc<:s described in this chapter. 

An alternative method would be to take a random sample of, say. 
40 observations from the large sample and determine the approximate 



243 


Multiple Curvilinear Regressions 

shape of the net regression curves from it Two or three alternative 
algebraic expressions of these shapes could then be fitted by machine 
processes to the full set of observations, and the best fit selected for the 
final expression 

Summary 

In this chapter methods of detemunmg curvilinear multiple Tegressions 
have been discussed These show the extent to which changes m the 
dependent variable are associated with changes in each particular inde- 
pendent variable, while simultaneously removing that part of the vanation 
m the dependent vanable which is associated (linearly or curvilmearly) 
with other independent variables A method of determining the curves 
by successive graphic approximations is presented step by step Since 
this method does not involve making definite assumptions as to the finM 
shape of the curves, it is to be preferred, at least for exploratory studies, 
to more mathematical methods, presented earlier m this chapter, unless 
there is a logical basis for the choice of specific functions Methods of 
simplifying the conclusions for popular statement are illustrated, and 
the universe to which they are applicable is briefly considered 

REFERENCE 

Eaekiel, Mordecai A method of handling curvilinear correlation for any number of 
variables Quart Pub Amer Stat Assoc, Vol XIX, No 148, pp 43MS3, 
December, 1924 

Note 14 1 In applymg the process described in the middle paragraph on page 246 
terms (or independent variables) whose values are insignificant compared to their 
standard errors should be dropped out one at a time, starting with the one which shows 
the least significance, and then the whole set of equations solved again with that one 
omitted (note pages 520-522) This is necessary because two terms (or mdependent 
variables) which arc highly mtercorrelated may show no sigmficance when both are 
included, yet if one of them is droppedout the remaining one may then show a significant 
regression Note also discussion of the underlying logic of this situation beginning in 
the last paragraph on page 473 



CHAPTER fS 


Measuring accuracy of estimate 
and degree of correlation for 
curvilinear multiple regressions 


In presenting linear multiple regression methods it u-as observed that 
coefScients could be computed to show (1) how closely estimated values 
of the dependent variable, based on the linear regression equation, 
could be expected to agree with the actual values; and (2) what proportion 
of the total observed variance in the dependent factor could be explained 
or accounted for by its relation to the independent factors considered. 
These coefficients were, respectively, the standard error of estimate and 
the coefficient of multiple correlation. Exactly parallel coefficients can 
be computed to show the importance of the relationship for curvilinear 
multiple regression, employing curvilinear net regressions such as those 
discussed in Chapter 14, The term standard error of estimate is again 
used to indicate the measure of the probable accurac)' of estimated values 
of the dependent factor. In measuring the proportion of variance explained 
we wil follow the usage in simple curvilinear regression, and use the 
term index to denote the fact that curvilinear regressions have been 
employed. The proportion of variance accounted for is therefore shown 
by the index of multiple determination, which is the square of the index 
of multiple correlation. 

Standard Error of Estimate. Values of may be estimated from 
X.j • • • > t 2 multiple curvilinear regression equation of the tj'pe 

1 ) = « -i-M^' 2 ) +/« 4 - . . . ( 15 . 1 ) 

where the net regression functions / specify cun-es fitted simultaneously 
either as mathematical equations or as graphic curves determined by a 
successive approximation process. In either case, when estimated values, 

2<9 



2S0 Multiple Curvilinear Regressions 

A’l, are determined from equation (15 1), for the observations included 
in the sample, and the residuals z are determined 

*1/(23 t) = “^1 “ -^1 (15 2) 

the standard error of estimate is then defined as the s,, so that 

■5'i/(25 (15 3) 

The standard error of estimate may be used to indicate the closeness 
with which new values of the dependent variable drawn from the same 
universe may be expected to agree with corresponding estimated values 
based upon the observed relationship w«th the independent variables 
Where the net regression curves are determined as algebraic equations 
as in the three-varjable exercise presented on pages 206 to 208, it is not 
necessary to go through the actual process of working out the X[ and z 
values for each observation Instead, the standard error can be computed 
directly from the values of and of R calculated by the usual equations 
In the example mentioned, the regression equation used was 

A'j = a + hXi + b'^Xl + b^X^ + KXl (15 4) 

and the values computed were = 1 1 and <2 gi 3 5*, » 0 968 The 
standard error of estimate for such mathematically determined curves 
may then be calculated m the usual way by the equation 

^1/(39 i) “ ■^(1 “ ^122 3S t) (155) 

which for this case = (I I)*(I — 0 968®) = 0 0762 
'^1/(28)= (*28 

Similarly, by equation (12 1), 

^/(Z3 t) = '^/(23 (15 6) 

For this example, 

'^/(2 3) = (00762)11 = 0 1016 
^ = 032 

Where graphic curves are used for the net regressions, j, has to be 
computed by carrying out the operations of working out Xl and 2 for 
each observation Thus m the problem worked through m Chapter 16 
for steel costs, the standard deviation of the final z, adjusted by equation 
(15 6), IS $3 86 per ton This is therefore the standard error of estimate 
in calculating steel costs for additional observations from the fitted net 
regression curves This compares with the unadjusted S of $2 88 Because 
of the larger size of m, the adjustment for n — w is even more important 
for curvilinear regressions than for Imear ones 



Accuracy of Estimate and Degree of CorrehtJon 25/ 

Index of Multiple Correlation. This coefticicnl Has a mtaninc ctactiv 
corresponding to that of the cocfficicni of multiple corrclaiion. but 
applies to curvilinear net regressions instead of linear. Both measure the 
simple correlation of the original values of (for the cases in the samplr t 
with the estimated values A’!, calculated accordinn to a regression equation 
including two or more independent sttriablcs. "nic coejficicni of multiple 
correlation is used \shcn the estimates arc based on linear net 

regressions for each of the independent variables -V-. .V_. The 

index is used when the estimates arc based on curvilinear net recrcs'ior.s 
for one or more of the independent variables, and is designated 

If the currilinear net regressions arc determined as algebraic eqaaiians. 
as in the first example mentioned above, the index of multiple correlation 
/ 1.03 .^is the same as the coefficient of multiple correlation 
for all the transformations of all the \ariables included in tlic regression 
equation. 

For the problem used as the example, where equation 15.4 was the 
regression equation used, 2 . 25 , 3 . 33 ) = 0.968. This is. accordingly, the 
value of /j 23 for that problem. 

Where net regression curves arc fitted by a graphic process, hovsever. 
the estimated values of and A'l must be worked out for each obsersafion, 
the residuals = calculated, and their standard deviation worked through. 
The multiple correlation index is then calculated from these values by 
the equation 

.1 — * 


1 


. 1 _ ■ 
ST if 


(15.7) 


For the stccl-cost example, this becomes 

I- _ 1 
^ 1,234 ■" '■ 


= 0.83956 

7.19- 


73.-34 =0.916 


Indexes of multiple correlation can be interpreted in much the same 
way as other correlation coefficients or indexes, as measuring what part 
of the variation in can be explained by its relation to A-, As.- etc. 
For the simplest interpretation the squares of the values should be used, 
which will then be called the index of multiple determination, following 
the usace of linear correlation. This measures the per cent of the \ariancc 
in A'j which can. be explained by the net curvilinear relations to Jl*. A 3 , 
etc. This will be desicnalcd d^ ij, and is defined as 

^../(e.Z,o = 7i.23...t ”5.8) 

For the two problems we have used as illustrations, the first case, xvith 
mathemalicallv determined net regressions, hxid I — 0.968, and the 



2S2 tAuhlpk Cur/lUntar Rcgrtislom 

second, the steel-cost case with graphic regressions, had / = 0 916 The 
corresponding d \alucs, 0 937 and 0 840, show that 94 per cent and 84 
per cent, respectively, of the total variance in was explained m the 
two cases by the curvilinear relations detemuned 
Meojur/ng the Net Importance of Indnldual factors, oceordlng 
to Their Curvliineer Regressions. The importance of each of the 
factors could be measured by an index of net eunilmear correlation 
hs Ji »• which would be defined in a way parallel to that of the coefficient 
of net or partial correlation rj* j4 „ Where curves are fitted as algebraic 
equations, such as equation (15 4), such a net correlation index might 
be defined as 

hea t = ^1(22*) (IS* St*) (159) 

It might be possible to work out a way of computing this value alge- 
braically, by a process similar to that used in footnotes 5 to 8 in Chapter 14 
This has not yet been done, however, and it seems likely it would involve 
some arbitrary assumptions in combining the variances due to cross 
products of Xi and ,^'3 

A more direct method of calculating ij, , can be used by some additional 
computation This method is applicable to all net regression curves, 
whether fitted algebraically or graphically This is done by correlating 
the curve readings for each independent variable for each observation 
with the original value of the dependent variable Xi, so as to obtain new 
linear partial regression coefficients between A\ and these transformations 
for each independent variable The equation to be used may be written 

X^ = a' + 633.34 r4 r/s(^3)] + buss (15 10) 

When that is done, and the partial correlation coefficients such as 
^12 9 4 » ^13 2 4 • efc , are computed, these values can be taken as indexes 
of partial correlation with 

*1234 = *’i2 54 (^^ ^0 

Tlus process may also change slightly the exact shape or slope of the 
net regression curves, and raise slightly the size of the index of multiple 
correlation Standard errors may also be computed for the b values of 
equation (15 10), which will serve to judge the significance of each vanable 
according to the transformations provided by the functions /j, etc 
Standard Errors and Confidence Intervels of the 5amp/e Statht/er. 
The values of S and / from a sample drawn from a real universe will tend 
to vary from the true values of the corresponding parameters in the 
umverse, and the shape of the net regression curves from the sample will 
also tend to vary from the true functions m the universe In cases where 
the sample is based on a correlation model, methods are available for 



Accuracy of Estimote ond Degree of CorrelaVon 2S3 

making inferences as to the true values in the universe, based on confidence 
intervals. These are discussed in Chapter 17. 

Summary 

For curvilinear multiple regression equations it is possible to obtain 
standard errors of estimate, indexes of multiple correlation and deter- 
mination, and indexes of partial correlation, which serve the same purpose 
that the comparable coefficients serve for linear multiple regressions. 

REFERENCES 

Ezekiel, Mordecai, The application of the theor)' of error to multiple .ind curvilinear 
correlation, Froc. Atrer. Stat. Assoc., pp. 99-104, \'ol. XXIV, March, 1929. 

, A first approximation to the sampling reliability of multiple correbtien 

curves obtained by successive graphic approximations, Ar.r.ab of Math. Slat., 
Vol. I, September, 1930. 

Mills, Frederick C., Statistical Methods, 2d edition, pp. 5S0-601. Henry Holt, Mr*' 
York, 1955. 

(Note; This reference treats the issues for simple curvilinear regression only. 
The use of the analysis of variance shown here, to determine the significance of a 
fitted simple curvilinear regression with attention to the degrees of freedom used 
up, may also be extended to multiple cursilinear regression.) 



CHAPTER 16 


Short-cut graphic method 
of determining net regression 
lines and curves 


In problems where the conclation is fairly high, the number of vanables 
IS not too large, and the number of observations is not over 50 to 100 
cases, net regression lines and curves may be determined by a combination 
of inspection and graphic approximation which takes only a fraction of 
the time required by the methods previously presented in detail This 
graphic method is very speedy and in the hands of a careful worker can 
yield results almost as accurate as those obtained by the longer methods 
previously set forth The short-cut method was invented by Louis H Bean ^ 

The general basis of the short-cut method is to select, by inspection, 
several individual observations for which the values of one or more 
independent variables are constant, and then note the changes in the 
dependent variable for given changes in the remaining independent 
variable This process is repeated for additional groups of observations 
foi which the other independent variable or vanables are constant (or 
practically so) but at a different level than for the first group The relation 
between the dependent variable and the remaining independent variable, 
as indicated by a senes of such groups, approaches the net regression line 
or curve, since the cases have been selected so as largely to eliminate the 
vanation associated with other independent variables A first approxima- 
tion line or curve is then drawn in by eye, and the residuals from this 
curve, measured graphically, are used to determine the regression for the 
next vanable, cases again bemg selected so as to eliminate the influence 
of other independent vanables The final fit of the several lines or curves 
is tested by the same successive approx'mation process employed in 

* L. H Bean, A simplified method of gnqihic curvilmear correlation, Journal of the 
American Statiztical Association,\o\ XXIV, pp 386-397, December, 1929 

254 



By the Short-cut Graphic Method 255 

Chaplers 10 and 14, or by a shorter graphic equivalent of it. Since the 
initial lines or curves approach much more clo^'Clv to the tinal net 
sions, and since graphic transfers of residuals are substituted for rcadinc. 
all the curves and calculating and c arithmetically, the process is much 
shorter, and fewer steps are required. Howes’cr, as mentioned earlier, 
the standard error of estimate converges to its minimum \aluc more 
rapidly than the net regression curs'cs converge to theirfmai slope andshape. 

The Short-Cut Method Applied to Linear Net Regressions. The 
short-cut method can be used to determine either linear net recicssions 
or curvilineai ones. The process of determining linear net rearession.s bv 
this method was illustrated on pages 269 to 276 of the second edition, but 
is omitted here to make room for other more important materials. In 
that illustration, with the approximations carried through two staecs, 
an adjusted standard error of estimate of 76.09 was obtained, as compared 
with 74.65 for the least-squares solution, and an R- of0.S30, comparedwith 
the exact value of 0.837. The net regression cocfTicicnts also differed 
only slightly from their exact values. While successive approximations 
would ordinarily not be used for determining linear net regressions, 
their use docs have the advantage that if cur\ ilincar regressions arc actually 
present, that fact will become apparent during the process. 

The Short-Cut Method Applied to Curvilinear Net Regressions, The 
greatest usefulness of the short-cut method is in determining net curvilinear 
regressions. Since the method of successive graphic approximations 
presented in Chapter 14 also depends on the convergence of successive 
approximate curves, the short-cut method secures results which arc just 
as reliable, at a great saving of lime. 

The procedure will be illustrated by a problem of four variables. The 
same method may be applied to larger or smaller problems equally well, 
up to the limit of the number of observations which can be kept separate 
on the chart paper. 

The data to be considered are given in Table 16.1. 

Data for 1938 and 1939 were also available when this study was made, 
but were disregarded until the analysis was completed, and then were 
used for checking the results. 

Logical Ro-ation or the Variadles. These data are from a study of 
the relation of volume of steel output to cost per ton. The qualitative 
examination of the problem (see discussion in publication cited in the 
footnote to Table 16.1) indicated that changes in wage rates might be 
expected to have a relative, or multiplying, effect upon the cost for a given 
output, so that the relation might best be examined in terms of: 

log Aj —/-(•Tz) +/n(-'^ 3^ 



256 Multiple Curvilinear Regressions 

Also, the quabtative exammatton revealed that major changes m technical 
methods of production, especially the beginnmg of the substitution of 
continuous strip mills for hand mills, had taken place dunng the period 
under consideration, and that these improvements m technology might 
need to be included, cither directly as a labor-efficiency factor, or indirectly 
as a trend factor 

Table 16 1 


Data for Short Cut Mwhod of Dcterminisg Regressiom Curves* 


Year, 

Jr, 

Cost per Ton of 
Finished Steel, 

Pft^tuon of 
Chpacity Operated, 

Average Hourly 
Earnings, 


Dollars per ton 

Per cent 

Cents per hour 

1920 

72 3 

88 3 

77 5 

1921 

78 5 

47 5 

602 

1922 

57 9 

71 3 

58 5 

1923 

63 0 

88 3 

67 0 

1924 

63 7 

69 0 

70 8 

1925 

62 9 

78 4 

70 3 

1926 

603 

88 0 

70 8 

1927 

59 6 

78 9 

71 3 

1928 

55 2 

83 4 

71 8 

1929 

51 5 

892 

72 5 

1930 

58 6 

65 6 

73 2 

1931 

65 6 

38 0 

70 8 

1932 

81 4 

18 3 

61 0 

1933 

650 

28 7 

590 

1934 

646 

31 2 

70 0 

1935 

65 4 

38 8 

73 0 

1936 

61 1 

59 3 

74 0 

1937 

65 6 

712 

86 0 


• The data are calculated fromregularpublishedreportsoftheir S Steel Corporation 
See Kathryn H Wylie and Mordecai Ezekiel The cost curve for steel production 
Journal of Political Economy, \ol XLVUI.pp 777-821, Deo , 1940 


To simplify this illustrative presentation, the data will be used in absolute 
values, instead of in logarithms The charts will be examined for indica- 
tions of multiplying relationship, however, since (as is shown in detail 
on pp 273-275) this graphic method can also be used to spot the presence 
of such non additive relations 



By the Short-cut Graphic Method 257 

Co>plTiONS ON' THE CtiRVES TO EE Drautc. Before procieedinc to the 
slHtisticsl steps in the exarnination of these data, the tv'pes of curves 
logically expected and the resulting conditions to be placed upon the 
shapes of the curves to be obtained roust also be considered. Wnthoul 
going into the underHing technical reasons fpresented more fullv in the 
original study}, Jet vs sssume that the following conditions will be imposed; 

On the net relation of cost to capacity: 

1. The curve may fall, at a declining rate, until a minimum is reached, 
and may then increase gradually after that minimum is passed. No 
points of inflection are expected. 

On the net relation of cost to waees: 

2. The curve will rise steadily, possibly at an increasing rate with 
higher w'ages, but othen,sise will be fairly uniform— that is, will be either 
a straight line or a shallow curve concave from above. There should be 
no inflections. 

On the net relation of cost to the time elements {efhciency, etc.) : 

. 3. The curve will tend to decline, perhaps slowly at first and then more 
and more rapidly as new techniques are introduced. There might also 
be irregular changes reflecting the changes in general price level (and in 
various purchased materials and services other than labor) during the 
period under examination, especially in the early 1920’s and after 1929. 
(This trend factor lumps together labor efficiency, price levels, and perhaps 
other factors, each of which might be given separate consideration in a 
more elaborate investigation.) 

Preunanary Examination of Inter-Relationships among the 
Independent Variables. As before, the inter-relationships of the several 
independent variables (including time for the trend factor) must be 
examined before the short-cut approximations can be begun. These arc 
presented in Figure 1 6.1, the years being used to designate the observations. 
After the dots were located, the successive years were connected by a 
light line, making it possible to consider the relations of (time) to 
^3 and Xz, as well as of Xn to X^, all on this one chart. (Tliis same method 
could be used even in non-time-series data by first classifying the data 
on the ascending values of one independent variable. Successive observa- 
tions, by number, would then indicate increasing values for that 
variable.) 

Examining first the location of the dots in Figure 16.1, without regard 
to their sequence, a moderate intercorrelation between wages (A'^s) and 
rate of operations (Afn) is evident. No low values of A'; are found, except 
together with low values of A' 3 . In the higher ranges of A* the snlues of A 3 
fan out more, varying from quite low' to quite high. Apparently there is 



2S8 Muhlph Curvilinear Regrestlons 

enough independence in the occurrence of the two vanaWes to permit of 
fairly good separation of their effects 
When examined with regard to time, however, the intercorrclation 
between Xt and A'j is considerably higher The low wages at high output 
all occurred in one period— 1921 to 1923 The marked positive correlation 
of wages and operations from 1930 to 1937 is also a correlation with 



fig IS f Wages and per cent of capaaty operated with successive observations 
connected to indicate shift in the XtXt relationship will} time 

time, both generally declining from 1930 to 1933, and both rising from 
1933 to 1937 Since this was the period when technological changes were 
greatest, it may be difficult to disentangle the time or trend elements here, 
reflectmg these technological changes, from the effects of the associated 
advances m output and m wages We shall have to be on guard for this 
as we proceed with the analysis 

Looking for groups of observations wtuch hold the other factorconstant, 
we note on Figure 16 1 that there were a considerable number of years 
when wages® fell between 70 and 75 cents per hour The observations 
• • Wage rates per hour" is quite a different thing from average earnings per hour 
employed,” since the latter is a weighted figure reflecting all changes m the composition 
of the labor force The latter is the figure used here (note Table 16 1) since an average 
wage rate figure was not available For brevity, however, the term wages will be 
used here to describe the data, even though that is not the technically correct designation 





By the Short-cut Graphic Method 259 

for these years may be used to hold wages substantially constant, sshilc 
the data are examined for the apparent effects of operation rate and time. 

DETERMINATIOX of FJRST ApPROXtMATlON' CURVE FOR FlRST lM>t- 
PENTDENT VARIABLE. The obscrs’ations for the years v.ith wares of 7{i 
to 75 cents are accordingly plotted on Figure 16.2 with percent ofcapacitv 
operated (To) as the abscissa and cost per ton (A-j) as the ordinate.^ 



Fig. 16.2, Cost per ton and per cent of capacity operated, and first 
approximation to f-{X-). 

After the dots are plotted, successive observations (when they occur in 
this group) are connected by a “drift line” of short dashes. This enables us 
to examine the relation of cost to operation rate and time while holding 
wages constant. 

These observations indicate at once a marked negative correlation 
between operation rate and cost. The data from 1924 to 1929 suggest 
a rapid fall in cost for a given rate, especially from 1927 to 1929. Ap- 
parently there was some further decline from 1931 to 1934. but the data 
for 1935 to 1936 fail almost precisely on the same line as those for 1930 
to 1931. (However, examination of Figure 16.1 shows that wages were 

» Great care should be exercised in plotting these values, as their exact location 
becomes the basis for all the successi'c graphic transfers. Clt.art paper of adequate 
size to separate the dots should be used. 



260 Multiple Curvilinear Regressions 

slightly higher m this latter period, which might obscure the trend factor 
at this point ) No curve is indicated as yet Accordingly, a line is drawn 
in lightly, as indicated, to show the relation of cost to operation rate for 
these observations, with the trend factor also considered * 

The observations for years of very low wage rates — 1921, 1922, 1932, 
and 1933 — are next plotted and consecutive years again connected by a 
line with long dashes for the first two, and short dashes for the latter 
Both show exaggerated drops in costs with increases in output Only 
1933 shows a cost lower than might be expected from the observations 
previously plotted If 1932 were also to show a cost below the usual 
relation, the regression curve would have to swing up sharply, so as to pass 
above it The high value for 1921 may be ignored for the moment, as 
possibly reflecting the high pnee levels at the end of the inflation period 
after World War I 

The two years of high wages — 1920 and 1937 — and the one remaining 
year of moderately low wages, 1923, are next plotted The dot for 1937 
falls above the other observations, and that for 1920 much higher still, 
apparently confirming the unusual (trend?) factors affecting the position 
of the 1921 observation Similarly 1923 is fairly high, despite its moderate 
wage rate, as compared to subsequent years 

The indications as to the effects of wage rates, to this point, sum up 
as follows 1920 to 1923 all show relatively high costs (with the exception 
of 1922) Apparently trend elements outweighed the effects (if any) of 
the low wages m 1921 and 1923 With low wage rates, 1933 shows quite 
a low cost for the low rate of output, whereas 1 932, with somewhat higher 
wage rates, shows a much higher cost Apparently the fall m output to 
near zero increases cost per unit very greatly On the basis of these 
considerations, a curve could be drawn m as the first approximation, 
extending the previous line but bending it up to pass well above 1932, with 
Its low wage rate With only one or two observations to support that bend 
at this stage, it seems best to be more conservative until the other factors 
have been more definitely allowed for, and until the evidence for a curve 
(if any) is more clearly established (even though a curve of declimng costs 
was expected ) 

Accordingly the straight line previously drawn m lightly is extended 
and used as the first approximation toward the net regression, 

(If a curve had been clearly indicated by the examination of the data as 

* By drawing this Ime parallel to the lines connecting successive years, all trend is 
elinunated except the one-year change If the line were tilted slightly more steeply than 
the line connecting successive yean, that would provide an approximate correction for 
the year to year change, also With tlwuncertainty of trend effects after 1931, however, 
that was not done here, but was left for subsequent approximations to clarify 



By the Short-cut Graphic Method 26 / 

described above, it would have been drawn in at this point, thus stariine 
the successive approximations from a curve instead of from a straichl line.) 

Determination of First App.roximation Curve for Second Indf- 
PENDEKT Variable. The next step is to examine the relation of costs, 
as now approximately corrected for the relation to operation rate by 
wages and time. Accordingly, the vertical departures of the 
dots on Figure 16.2 from the line of/o(Ay are scaled off, and arc plotted 
in Figure 16.3.® The departures are plotted as ordinates above and below 



55 60 65 70 75 80 85 90 95 

Wages, 


Fig. 16.3. Wages and cost per ton adjusted to average operation rate on the 
basis of the first approximation, and first approximation to /{(A'j). 

the zero line, with the values of wages, as abscissas. If the fourth 
variable, X^, were not a time series, or not arranged in order, it would be 
necessary to group these observations according to its value, also, as svas 
done in plotting Figure 16.2. Since the numbers of the successive years 
indicate the successive values of that is not ncccssarv’. .After the dots 
are all plotted, the successive years are connected by a light dotted line, 
to aid in separating the trend influences from that of wages. 

If the dotted line to the successive years is followed, it is apparent 
that there was a general dow'nw'ard tiend in the adjusted costs. The 3 cars 

' The job of making these readings and transfers can be made svsifier and more 
accurate by using the technique outlined on pages 526 to 5 j0. 




262 


Multiple Curvilinear Regressions 

1920 and 1921 appear on one level, the years 1922 to 1927 on a lower 
level, and the years from 1928 on (with the exception of 1932) on a still 
lower level In each of these groups of years there is a positive relation 
between adjusted costs and wages as indicated by the light lines drawn 
through each group Only the last group has any indication of a curve 
Even there, the curve depends entirely on the position of the two extreme 
observations, one at each end Here, however, the lower portion of this 
curve parallels, almost exactly, the lines indicating the apparent positions 
for the two other groups wluch in turn lie mainly on the left half of the 
lower group of observations Furthermore, the shape of the curve — 
shallowly concave — is consistent with that logicallyexpected Accordingly, 
a shallow curve passing through the center of the observations is drawn 
m, approximately paralleling the drift lines and curve representing the 
relations for the three groups The succeeding successive approximations 
will show whether this curve is justified or whether a straight line should 
be substituted 

Determination of First Approximation Curve for Third Inde- 
pendent Variable The next step \s to examine the relation of costs, 
now approximately adjusted for both wages and operation rate, to tunc 



Fig 16 4 Time and cost per ton adjusted to average operation rale and wages, on 
the basis of the first approximation curves, and first approximation to/ 4 (A' 4 ) 

Accordingly, the vertical departures of the dots on Figure 16 3 from the 
curve /aXX's) are scaled off, and are plotted m Figure 16 4 Again the 
departures are plotted as ordinates, with this time the values of JT 4 as 
abscissas Since this is the last independent variable to be considered. 




263 


By the Short-cut Graphic Method 

il is not necessary to group the observations with respect to anv other 
variable but all can be plotted and examined as a whole. Fieure 16 A 
shows the resulting chart. Connecting the successive )ear> 'males it 
easier to study the type of trend present.® 

Except for the single wide departure in 1932. Figure ifi.4 indicates a 
definite downward trend from the beginning, taperin'g off about 1930 and 
running flat or gradually rising thereafter, faking midpoints bclw cen each 
pair of observations (indicated by the cros.ses) helps to locate the approxi- 
mate level of this trend. The one extreme departure. 1932. is disrcirardcd 
in the process. Its position in Figure 16.2 at the extreme end of the line, 
meant tliat its adjustment for Al was in doubt. A smooth curve is then 
drawn in, declining to about 1930, and running flat thereafter. The risinc 
trend indicated by the observations for 1936 and 1937 is left for subsequent 
approximations to confirm. In general it is unwise to give an extra 
“twist” to a regression curve simply on the evidence of one or two 
observations. 

Determination of Second Approximation Curve for First Inde- 
pendent Variable. We now have determined first approximations 
to the net regression lines or curves of A'j on X\. X^. and A’.. The depart- 
ures of the dots on Figure 16.4 from the regression line /-(Xi) arc the 
residuals, z" , from this first set of curves. The remaining steps involve the 
graphic transfer of these residuals to each curve in turn, the correction 
of each curve on the basis of the fit of the new residuals, and in turn the 
transfer of the newly corrected residuals to the next curve, and so on 
until no further change is indicated in any of the curves. Ordinarily the 
residuals from Figure 16.4 would be plotted back on the original curve 
for ATj, Figure 16.2. To show the process clearly, however, the dots and 
the first approximation cun'c for from Figure 16.2 are reproduced 

again as Figure 16.5. 

The vertical departures of the dots on Figure 16.4 from the approxi- 
mation curve, //(A' 4 ), are then plotted on Figure 16.5 as dcp.irturcs above 
and below the regression line, with the corresponding values of 

A'o as abscissas. To prevent confusion with the original values shown as 
solid dots, the corrected values are indicated as hollow dots. 

It is at once apparent, on inspection of Figure 16.5. after the corrected 
values are all plotted in. that the new values show much less scatter than 
the orisinal values. Closer inspection reveals that every one of the adjusted 

* If joint functions are suspected (sec Chapter 21) the d.ala might again be ground 
for values of A'j and A'j, in plotting Figure 16.4. If these groups showed varying 
relations to A',, even after the appro.ximate rcJ.ations to A; and XjJvad now been 
eliminated, that would indicate the presence of a joint rehition. Note Figure 16.S, and 
the discussion on pages 273 to 275 of this chapter. 



264 Mu/f/p/e Curvilinear Regrcjs/on, 

observations below 60 per cent of capacity falls above the first approxi- 
mation Ime, with a single exception In the range from 60 per cent to 
80 per cent, 3 cases fall below the first approximation Ime (2 widely) 
and 3 slightly above, indicaUng in this range that the new line should be 
lower than before The 5 observations above 80 per cent fall 2 below, 
2 about the same distance above, and I nght on the line, mdicatmg that 
the position of the Une here is about correct These departures confirm 



Per cent of capacity operated X 2 


Fig 16 5 Per cent of capacity operated, and cost per ton unadjusted and adjusted 
to average values of other vanables and second and third approximations to/i(A'i) 

the suggestion previously given by the 1932 value m Figure 16 2 that the 
regression should be a curve, concave from above This accords, also, 
with the logical conditions ongmally imposed on this relation Accord- 
ingly such a curve is drawn m freehand, passing as near as possible through 
the averages of the adjusted values in each successive group (To facilitate 
drawing the curve, the average of the residuals m successive ranges of 



By the Short-cut Graphic Method 2SS 

10 to 15 units of r. are estimated graphically and drawn in as hollow 
squares.) 

DETER^^^'ATIO^■ OF SECOND APPROXIMATION CUR\T: FOR SECOND iNDE- 
PENDENT Variable. The vertical departures of the adjusted values (the 
hollow dots) above or below the second appro.ximationcurvc,//(A',), are 
next scaled off graphically and plotted as ordinates from the values o.'" the 
/aX-^a) as zero, with the corresponding A '3 values as abscissas. This is 



Fig. 16.6. Wages, and cost per ton adjusted to average values of a!! other 
variables, and second and third approximations to /-(Xi). 


generally done on the original cliarl (Figure 16.3). For clarity, 
however, the curve of Figure 16.3 is here reproduced on Figure 16.6, and 
the departures from Figure 16.5 are transfer! ed to this new chart. The 4 
observations around 60 for X 3 average definitely below the line; both 
the next group up to 72.5 and the next group 72.5 up to 75 average 
slightly below, whereas the single observation abov'c 85 falls above the 
line. These averages are indicated by squares on Figure 16.6'. The 
single high observation at the end alone would not be enough to indicate a 
change in the curve, but it is consistent willi the group averages, which 
indicate the need for a slightly steeper curv'c tixan the original one. 

* These averages have been estimated graphically, by the technique explained on 
page 530. 




2 ^ Mu/fjp/e CuniUnear fiegresj/ons 

Accordingly this new curve is drawn in, approximately through the group 
averages, but still conforming to the conditions stated on p 257 To this 
point none of the relations, as indicated by the data, has differed suRi- 
wently from the shapes logically expected to require any reconsideration 
of the logical analysis from which the conditions limiting the shapes to 
be drawn were derived 

Determination of Second Approximation Curve for Third Indepen- 
dent Variable The same process is used m determining the second 
approximation for the next variable The vertical departures of the dots 
on Figure 16 6 above or below the second approximation curve, 
shown as a dashed line, are scaled off and plotted as departures from the 
curve, with the corresponding Xi values as abscissas Again a new 
chart IS prepared, Figure 16 7, with fl(X^ reproduced, although the 
original chart, Figure 16 4, is still clear enough so that these new values 
could readily have been plotted upon it Again, as the observations are 
equally spaced in time, a continuous light line is drawn m, connectmg the 
successive observations 





0 Ot»e>v*t om cot'ecttfl far 







ft 

. J 






■ 

p 

p 







■ 


ngi 


■ 

1 

\ ; 

rstfpproi m*ton 
wry* r.fX,) 



■ 

i 

i 


TT 


li 


m 


1 


1 

i 




L 



1920 1922 1924 1926 t92B 1930 1932 1934 1936 1938 

Year 


Fig 16 7. Time, and cost per ton adjusted to average operation rate and wages on 
basis of second approximation curves and second approximation to/4(Af,) 


If the curve were any ordinary function— anything except a trend 
allowance for a number of unrepresented factors — there would be little 
evidence, from the dots in Figure 16 7, for any further change m the 
fitted curve Since it is a trend allowance, however, and was expected 




By the Short-cut Graphic Method 

to be irregular on logical grounds (note the conditions stated on r-ac-' 25 i | 
more flexibility may be in order. Comparing Figure 16.7 with Fiavrre Tb 4 
we sec that the observ’ations have been changed ordy slichtfv bv the 
further adjustments for/, (A',) and/3(A4). The individuarobsersations 
on both charts show a pronounced fall from 1920 to 1924. a flatton i <• 
out then for three or four years, then another fail to 1929. Bcfacen i9‘’t 
and 1927, Figure 16.7 shows that 4 out of 5 obsersations fail above tl.'e 
line, whereas, between 1928 and 1935. 6 out of the 8 observ i'dons 
fall below the line. These departures indicate that some chances in the 
first curve are justified. It is apparent that these chances would not re 
inconsistent with the possible composite elTects of price-level chances and 
a general upward trend in production efficiency, with a corrcspondinc 
downward trend in production costs relative to wacc rates and other 
factors. The sharp fall from 1920 to 1924, however, largely reflects the 
two hich observations for 1920 and 1921, offset somewhat bv a ven, io.v 
observation in 1922. Accordingly, the trend may bo interpreted as 
moderately downward from 1920 to 1926. more sharply downward to 
about 1929, then gradually tapering off to a low about 1933 or '934. and 
rising gradually thereafter. A more flexible trend is therefore drawn in 
according to these general changes but not following single observations 
to the extremes of their departures.® 

Determination' of Third Appro.xtmation Curves. The s.ime proce^j 
as before is now repeated, plotting the departures from around the 
fliX^ cuH'e, with X^ values as abscissas. This lime the new departures 
shown on Figure 16.7 are plotted back on the previous cliart. Ffgure 16.5. 
Crosses are used for the new departures, to distinguish them from the 
previous values shown as hollow dots. To prevent confusing the chart, 
the observation (year) number is not shown with the cross, e.vcept where 
there are two or more observations with about the same .\4 value. 

Examining the location of these new crosses on Figure 16.5, we notice 
that, for every observation with a value below 50 for Ao, the cross is one 
to one and one-half units (of A\} higher than the corresponding dot For 
values of Xn above 50. however, the crosses fall alternately above and 
below the corresponding dots, with the averages of the crosses hitting 
just about the curve. This pattern indicates that the/j{.V;) cun.c should 
be raised somewhat below 50. to be still steeper. Accordingly, a new 
curve is drawn in, changed as indicated, to pass as near as possible through 
the group averages of the crosses (as graphically estimated) and }ci 
conform with the logical limitations on its shape. 

* Only in rare instances would a curve with this much ficvibiluy be justilied. Jnj.his 
particular case its use is in line both with the thcoreiical .vnah'Sis nno she rcsulung 
conditions imposed on the shape of the cun'c. 



268 Multiple Curvilinear Re^ress/ons 

The vertical departures of the cross« from the new curve, are 

then carried forward to Figure 16 6, as departures fromy^(A' 3 ) Again 
crosses are used to represent the new values 

Inspection of Figure 16 6, after the crosses are inserted, discloses a 
different situation from that m the previous chart In the left portion of 
Figure 16 6, for values of below 65, the crosses fall very close to the 
corresponding dots, with no change for the average In the right-hand 
portion, for values of X^ above 75, the crosses also fall above and below 
the corresponding dot Between 65 and 75, however, a number of the 
crosses fall a considerable distance below the corresponding dot, so that 
out of the twelve observations in this range, six crosses fall slightly above 
the/" line and six fall a considerable distance below This pattern indicates 
that the /' curve should be made more sharply concave, without changing 
the elevation of either end A new curve is therefore drawn m to correct 
this, through the group averages of the crosses (To prevent confusion, 
these averages are not shown on Figure 16 6) The sharp lift in the last 
portion of this curve is dependent only upon the two observations, 1920 
and 1937 However, the shape of this part of the curve is consistent with 
the logical limitations and with the other observations Except for these 
two observations, a straight line would fit the crosses almost as well as 
the curve The evidence for the existence of a cuive, or for its exact 
shape, IS thus very uncertain, as the data are distributed here * 

If the/* curves are compared with the /' curves on both Figure 167 
and Figure 16 6, it is evident that we have determined the shape of these 
curves about as well as we can with the data at hand Even with the 
material change m the trend by using the much more flexible curve of 
fi (Xt), the differences between the /' curves and the /" curves for A'j and 
Xs are insignificant However, to complete the process we carry the 
final residuals, the departures of the crosses on Figure 16 6 from the 
/siXs) curve, over to Figure J6 7, as departures from the trend line 

There is little improvement m the average closeness of the crosses to 
the trend line,/* (AT*), as a result of the slight changes m /g and /g The 
general characteristics of the trend, as fitted b^ the previous flexible curve, 
remain the same From 1923 to 1930, every cross falls slightly above the 
corresponding dot, suggesting the possibility of a slightly better fit if the 
trend was raised a little m this portion The single high value m 1932 
continues to stand out, alone and unexplained It seems hard to justify it 
on any trend basis We could ekminate the wide departure for 1932 by 

* See page 291 of Chapter 17 for the sampling reliability of the portion of a curve 
detennined by such extreme observations, where the theory of random sampling may 
be properly applied 



269 


By the Short-cut Graphic Method 

twisting the lower end of /, (A^) up sharply to pass through thi> Mn-lc 
ohsers ation. In the absence of confirmatory esidcnce from anotlrcr ^U'^h 
low year for percentage of capacity operated, thi^ M.ou]d be T 
assumption. 

Although it would be possible to modify the trend further, as simce^tcd 
in the preceding paragraph, it seems best to let it stand unchanccd. !r. 
view of the slight changes in the/, and /g curs-es in. the last appro\rmatic!u 
we end the successive approximation process at this point, fcciinE v c hasc 
earned the process about to the point of diminishms returns in incrca'cd 
accuracy. 

It should be noted, in Figures 16.5, 16.6, and 16.7. that the final curves 
at the end of the approximation process differ significanllv from the 
first approximations only in the case ofyi. (A'^h Almost the came flexible 
trend of (X^) could have been drawn in the first approximation on 
Figure 1 6.4. The closeness with xvhich /3 ( X^ifl ( A', ), and y I { .V.) approxi- 
mate the final curves is an indication of the great poxser of the graphic 
method in making a rapid approach to the underlying relations. The 
routine of comparing selected observations for which the values of the 
other independent variables are constant, or almost so. and judging the 
net relations from these selected comparisons provide' a much closer 
initial approximation to the final curves than does the imtial assumption 
of linear net regressions, used as the starting point in the succesrive 
approximation process presented in Chapter 14. 

(For an exercise, the student might take the example which has just 
been analyzed and determine the net regression curxes by the method of 
Chapter 14, using the same limitations on the shape of the curxes as used 
here. That will enable him to compare the relative speed and effectiveness 
of the two methods in approaching the final curxes ) 

As already noted, the intercorrelations among .^' 2 , .V;, and were 
only moderate in this case. In a problem where the intcrcorrelations 
among the independent variables vvere quite high, the chances in the 
several regression curves as a result of the successive approximation 
process might be more marked than in the e.xamplc just completed. In 
such a case the convergence toward the curves of best fit will be slower 
than W’here the intercorrelations are low, and a larger number of successive 
appro.ximations will be required to determine the final cunc^. 

If, after several approximations have been made, the new curves start 
swinging up and down over curves prex'iousiy determined, the approxi- 
mation has probably been canied far enough EspecicHy where the 
intercorrelalions for two independent x'ariables arc very high, a rise in 
the slope of one curve will cause a fall in the slope of the other. In such a 
case the exact position of each of the two curves is indeterminate, and the 



i70 MutUpte Curvilinear Regressions 

zone within which the last two or three approximations vary will indicate 
something of the uncertainty as to the exact shape or location of each 
curve As will be shown later (Chapter IT), the reliability of any net 
regression line or curve varies inversely with the extent to which the 
particular independent variable is correlated with the other independent 
variables Where two variables are so closely correlated that the relation 
to the dependent variable may be ascribed to either independent variable 
or parceled out more or less arbitrarily between the two, their individual 
effect IS indeterminate Only by secunng a large enough sample can the 
true influence of each be judged When used with due regard to the 
logical significance of the curves obtained, any one of the several methods 
will tend to give results which are substantially the same — that is, which 
lie within the range of possible accuracy imposed by the facts of the 
particular sample 

Determining Standard Error of Estimate and the Index of 
Multiple Correlation The standard error of estimate may now be 
determined by first computing the value of This can be done most 
simply by scaling off, on Figure 16 7, the departures of the last adjusted 
values (the crosses) from the final trend curve These departures are the 
8"’$ Any errors which have been made in any of the successive graphic 
transfers will accumulate in these residuals A more exact check can 
be made by reading off the estimated values for each observation from 
the final curves and adding them up to calculate the estimated and t”, 
according to the same method used m Chapter 14 The z values as 
computed m this manner should agree closely with the t'"'& scaled from 
the final approximation chart These calculations are shown in Table 16 2 

Column 10 of Table 16 2 gives the residuals as scaled off from the last 
approximation curve on Figure 16 7 Column 9 gives the residuals as 
computed in the usual way from the several curve readings It is evident 
that the two columns agree very closely, the largest difference being only 
0 4 This is an indication of the degree of accuracy maintained in the 
successive graphic transfers In this case graph paper 8 by 10 inches was 
used m preparmg the charts for Figures 16 2 to 16 7, and each of the 
transfers was double-checked If higher accuracy in the mechanical 
process is desired, a sTd\ larger scale could be employed 

Taking the residuals in column 9 as the most accurate, we may now 
calculate their standard deviation (around their own mean) It works out 
at 2 88 This compares with a standard deviation for A'j of 7 19. 

Before computing Sj we need the values for « and m A simple 
parabola or hyperbola with two constants would probably represent 
fz (^z) and fs (A's) However, with its two inflections would 

probably require at least three constants In addition, there is an a 



By the Short-cut Graphic Method 

Table 16J; 

Calculatton of EsnsfATEXi A'l r?o‘.! 

rrs-4t Rscs- 


Cvkvu 

271 

Year, 

A', 


T, 

J7(x.) 

.rra-,-) 



-Vs 

{9-7) 


{IJ 

(2) 

(3) 

(4) 

(5) 

(6) 

U) 

<8> 

(9) 

(!0) 

1920 

S8.3 

77.5 

57.1 

4.9 

9.7 

71.7 

72-3 

0.6 

0.9 

192! 

47.5 

60.2 

67.8 

-l.S 

8.1 

74.1 

73.5 

« « 

f * 

1922 

713 

58.5 

60.5 

-2.1 

6.5 

64.9 

57.9 

—7.0 

-AO 

1923 

SS.3 

67.0 

57.1 

-0.3 

4.9 

61.7 

63.0 

1.3 

1.5 

!92i 

69.0 

70.8 

61.0 

1.0 

A 

65.4 

63.7 

-1.7 

-l.S 

1925 

78.4 

70.3 

59.1 

0.8 

1.9 

61.8 

62.9 

1.1 

o.s 

1926 

88.0 

70.8 

57.2 

1.0 

0.3 

59.5 

62.3 

1.8 

2.1 

1927 

78.9 

71.3 

59.0 

1.2 

-1.6 

58.6 

59.6 

1.0 

1.3 

}92S 

83.4 

71.8 

58.1 

1.4 

-3.7 

55.8 

55.2 

-0.6 

-0.5 

1929 

89.2 

72.5 

57.0 

1.8 

-5.4 

53 4 

51.5 

-1.9 

-1.7 

1930 

65.6 

73.2 

61.9 

2.2 

-6.3 

57.8 

5S.6 

0.8 

1.0 

1931 

38.0 

70.8 

72.2 

1.0 

-6.9 

66.3 

65.6 

-0.7 

-0.7 

1932 

18.3 

61.0 

84.6 

-1.7 

-7.3 

75.6 

81.4 

5.S 

5.9 

1933 

28.7 

59.0 

77.3 

-2.0 

-7.5 

67.8 

65.0 

—2.8 

-IS 

193-1 

31.2 

70.0 

75.8 

0.7 

-7.4 

69.1 

64.6 

— 4.5 

-4.1 

1935 

38.8 

73.0 

71.7 

2.0 

-7.0 

66.7 

65.4 

-1.3 

-l.I 

1936 

59.3 

74.0 

63.5 

2.6 

~6A 

59.7 

61.1 

h'i 

1.3 

1937 

71.2 

86.0 

60.5 

11.0 

—5.4 

66.1 

65.6 

-0.5 

-O.S 


* These are the ^■aIues of = ' scaled off froni Figure 16.7. 


constant, represented by the mean of the =""s. Altogether, then, it ^'.ouId 
probably take eight constants to fit mathematical cun es to tiie regression 
functions graphically determined. .Accordingly, n =: IS and m — S. With 
these values, we can now compute S by equation (15.6). 

_ _ I5I2.SS=> _ 

IS-S 

Similarly, by equation (15.7) 

^1.234- 1 ^ (7.}9^e 

= 0.8396 
L = 0.916 



272 Multiple Curvilinear Regressions 

The index of multiple correlation, 0 92, is moderately high The 
adjusted standard error of estimate, 3 86, indicates that if it were possible 
to measure this same relationship from a very large sample drawn from 
the same universe, the errors m estimating steel costs for the observations 
in that large sample would probably have a standard deviation on the 
order of 3 86, rather than 2 88, per ton 

Estimating Cost for New Observations We can now use the data 
for 1938 and 1939, which we have disregarded to this point, to work out 
estimates for those years from the regression curves, by the same process 
that IS shown in Table 16 2 The valuM are 


Year 

Xt 


/?<•»» 


/:(« 




1938 

36 2 

90 0 

730 

14$ 

-4 3 

83 2 

80 5 

-27 

1939 

607 

89 7 

631 

142 

-30 

74 3 

760 

17 


Just as in the similar example in Chapter 14, it is necessary to extrapo- 
late two of the regression curves beyond the base data m making this 
estimate for subsequent years In spile of the additional possibility of 
error which this introduces, both of the new estimates show residuals no 
larger than 3 4 , This indicates that the changes in steel costs dunng 
these next two years were m general related to the same factors as during 
earlier years and to about the same degree (The student can check this 
conclusion by adding these two new observations to the original data, 
and re-analyzmg the resulting sample of 20 observations ) If the trend or 
other factors were extrapolated much further, or if a sudden change m the 
conditions surrounding the industry were to occur, much larger errors of 
estimation might be expencnccd (as did happen later due to great cluinges 
in general price levels ) 

Restating Short-Cut Results for Publication The same methods 
described on pages 234 to 240 of Chapter 14 can be used with curves 
obtained by the short-cut process, to prepare them for publication 
There is a shorter method, however, which takes advantage of the fact 
that the curves obtained by the short-cut method are already m terms 
of a net value of X^, for one vanable, plus adjustments to that value for 
the other variables All that is necessary is to determine the average 
value of the final e’s and use this average as the a constant (In the 
illustrative example just given, this average was only 0 07, and consequently 

S« Chapter 20 for the application of sampling concepts and error formulas to 
time senes 




273 


By the Short-cut Graphic Method 

was ignored.) Then the final functions arc determined as follows (for the 
final curves of the illustrative problem): 

^3(^) — Jo (^a) 

•^4(^4) = X4 (^4) 

It is evident that, except for the slight adjustment of adding a to the 
first curve, these curves arc the same as the final curves shown on Figures 
16.5, 16.6, and 16.7. 

Identifying “Joint” Relations by the Short-Cut Process. In some 
problems the relation between the variables is such that the independent 
variable cannot be explained fully by a regression equation which ar/dr 
the regression of on variable X-, to that on X^. etc. Instead, in such 
cases the relation is so complex that the net change in A'j with given 
changes in X^ will vary with the associated values of A'j or other variables. 
This type of relationship, designated “joint correlation,” is discussed 
subsequently (Chapter 21). Where such correlation is present, it will show 
up in the process of examining the subgroups of observations in the first 
steps of the short-cut process. 

The following empirical data will serve to illustrate the occurrence 
of joint correlation;^^ 


L Vc&llUll 

mber 

^1 

A; 

A'a 

A* 

1 

216 

9 

A 

n 

6 

2 

160 

10 

8 

2 

3 

140 

2 

7 

10 

4 

264 

4 

11 

6 

5 

30 

5 

2 

3 

6 

56 

7 

1 

8 

7 

5 

1 

5 

1 

8 

16 

2 

2 

4 

9 

70 

2 

5 

7 

10 

126 

7 

6 

3 

II 

180 

10 

3 

6 

12 

280 

5 

7 

S 

13 

120 

3 

4 

10 

14 

25 

1 

5 

5 

15 

22-1 

4 

8 

7 

16 

120 

6 

0 



» From Wilfred Malcnbaum and John D. Black, The us; of th: 
method of multiple correlation, Quarterly Journal of EcoixrArs, 
November, 1937. 


si'.ort-cut Graphic 
Vcl. LII,‘'r.’97, 



274 Multiple Curvilinear Regressions 

The number of cases here is so small that it is difficult to eliminate the 
effects of X 2 and X^, to determine the first approximation to the XiX^ 
relation An approximate grouping can be made, however, by classifying 
the observations into three groups, as follows 

1 Those with X^ and X^ both larger than their respective means 

2 Those with X^ and X^ both smaller than their respective means 

3 Those with X^ and AT* one above and one below their respective 
means 



Fif 16 8 Relation of AT, to with observations classified on AT* and Xt 
'Mbav, wOTfet'A VK. vaA, tb*. w.'gitsaiwi Xi -Wi Xt vppear* 

to shift with the accompanying values of Xi and A*! 

This gives groupings with four observations (3, 4, 12, and 15) m the 
first group, four (5, 7, 8, and 14) m the second, and eight (1,2, 6, 9, 10, II, 
13, and 16)m the third Plottmg each of these groups of observations, and 
drawing an approximate line through each, gives the results shown in 
Figure 16 8 

This figure differs from those we have examined previously (such as 




By the Short-cut Graphic Method 275 

Figure 16.3) in that the relations as shown by the scscrai subgroups do 
not parallel one another at relatively constant distances, but instead 
diverge sharply. It appears, therefore, that the relation of A', to .V. depends 
not only on the value of A'j but also on the asrocialcd lalucs of X- and .V^. 

In this particular ease the progrcssisc nature of the relations shown 
on Figure 16.8 might lead us to suspect that the relation, instead of bdne 
an additive one, is a multiplying one. If that is the ease, thoiich it could 
not be represented adequately by an equation of the type; 

^1 ‘F/j('V4) 

it still might be represented by: 

If that is the case, it can be determined by using the relation: 

log =A(log X.) d-Zaflog Ay +/ 4 (Iog A' 4 ) 

We can test whether this is likely to give a satisfactory fit by rcplotting 
Figure 16.8 on double logaritlimic paper, or by plotting it on ordinary 
paper, substituting the logarithms of Xi and Xo for the natural values. 
Let us do the latter. 

When that is done, the relations appear as shown in Figure 16.9, The 
three lines, fitted roughly to the three sets of observations, now appear 
more nearly parallel. In particular, the line of the upper group, which 
in Figure 16.8 made almost a 60-dcgrcc angle with the line for the lower 
group, is almost perfectly parallel to it in Figure 16.9. Apparently in this 
example the problem can be handled satisfactorily by the usual short-cut 
procedures, merely by transforming the variables from natural numbers to 
logarithms. 

Where tliis transformation, or other simple transformations, do not 
serve to make the successive subgroups show approximately parallel 
relations, the methods of Chapter 21 must be employed instead. 

Application of the Short-Cot Method to Large Samples. The 
short-cut method might be applied to samples too large for plotting the 
individual observations separately, by using a modification of the process 
of subgrouping and averaging illustrated in Chapter 23. The averages 
from Table 23^, plotted in Figures 23.1 and 23.2, indicate quite well the 
final slope of the net regression lines. That is because the influence of the 
other independent variable was largely held constant by the process of 
subclassifying. In the same way the lines of averages from subgroups 
would tend to indicate the regression curves in problems where curves 
were needed. With a sufficient number of observations, the first approxi- 
mation to each of the net regression curves might be obtained from charts 



276 


MuHiph Curvilinear RegresHons 



fig 169 When the toganthms ofthedaUshown 10 Figore 168 are used the 
net regression of Xi on Xt »s found to be about the same, regardless of the 
aocompanymg values of Xy and A', 

of subaverages similar to Figures 23 I and 23 2 on pages 390-391 These 
several first approximation curves could then be made the basis for 
working out estimated values of Afj and residuals The process of 
successive approximations could then be continued exactly as illustrated 
in Chapter 14 Since the first approximation curves would approach 
fairly near to the true net regressions the number of approximations 
requited to obtain the same closeness of fit would usually be less tluin by 
the earlier method 

Combination of Short-Cut Procedures ond Mothematicol Procedures 
Both the short cut method of this chapter and the longer successive 
approximation method of Qiapter 14 depend on graphic methods m 
arriving at the curves of best fit Where especially lugh accuracy is 
desired, the final slope of the several curves can be checked by least 
squares, according to the methods set forth in Chapter 17 on pages 291 to 
293 

Some investigators prefer to use the short cut method to determine the 
approximate shapes of each of the several net regression curves, and then 




By the Short-€ut Graphic Method 277 

to determine the final net renrcsMons by fitlinc alscbrnic curves capable of 
representing those scs’eral shapes. The technique for fi’tinc thc<e mathe- 
matical curves to several variables is also set forth in Chapfer ]4. ff there 
is a logical basis to support the curves emploscd. there is salue tn thi^ 
procedure. If the equations are simply sclected”^ empirically, boss ever, the 
mathematical curves have no more meaning than the graphic ones, for 
the reasons already discussed fully in Chapter 6, except for greater case 
and certainty in determining confidence intervals for each regression. It 
is true that any one fitting the same set of matlicmatical curves to the same 
data by the same method will get exactly the same result, to the fifth 
decimal place in the values of the constants, if desired. Curses obtained 
by different investigators by cither graphic process, on the contrarv, mav 
vary slightly from one to another. But the identical con.sfants obtained 
by the least-squares fit have only a fictitious accuraev, as compared witli 
their standard errors, or with the zone of uncertainty within wltich the 
function can be determined from the given set of observations. Multiple 
regression curves are dependable only with respect to this confidence zone, 
rather than to the exact line (as c.xplaincd in Chapter i 7). 

Summary 

Under certain conditions first approximations to multiple regression 
lines or curves may be obtained directly from the original observations by a 
graphic process based on the comparison of individual observations, 
considering several variables simultaneously. This process eliminates the 
necessity of computing linear regressions by arithmetical means. Further, 
it substitutes graphic measurements for arithmetic calculations in correcting 
these curves to their final shape by successive appro.ximations. It requires 
the researcher to examine his data more thorougWy and so to c.xcrcisc 
thought and care in working out the relations and in interpreting their 
significance. Carefully used, it materially reduces the time required in 
determining multiple regression curx’es. 

REFERENCES 

Bean, L. H., A simplified method of graphic cun-ilincar correlation. Jour. Arrrr. Srar. 

Assoc., Voi. XXIV^ pp. 3S6-397, December, 1929. 

Waite, Warren C., Some characteristics of the graphic method of co.Tclation. Jour. 

Amcr. Slai. Assoc., Vol. XXVTI, pp. 6S-70, March, 19_'2. 

Ezekiel, Mordecai, Further remarks on the graphic method of corrcliticn. Jour. Amcr. 

Star. Assoc.. Vol. X'XVII, pp. 1S3-IS5, June. 1932. 

Maicnbaum, W., and J. D. Bb.ck. The use of the short-cut graphic method of muhiplc 

correlation. Quart. Jour. Econ., Vol. Lll, pp. 6t>-l 12, biO»ember, 193 <. 



275 Multiple Curvilinear Regressions 

Bean, L H , and Mordecai Ezektel, The use of the short-cut graphic method of multiple 
correlation. Comment, and Further comment. Quart Jour Econ . Vol LV, 
pp 318-346, February, 1940 

Wellman, H R , Application and uses of the graphic method of multiple correlation 
Jour Farm Econ, \o\ XXIII, pp 311 - 316 , February, 1941 

Waite, Warren C , Place of, and limitations to, the method Jour Farm Econ , 
Vol XXIII, pp 317-322, February, 1941 

Working, E J , and Geoffrey Shepherd, Notes on the place of the graphic method of 
correlation analysis, /o«r Farm Econ , Wot XXHl, pp 322-323, February, 1941 

Foote, Richard J , and J Russell Ives, The relationship of the method of graphic 
correlation to least squares, U S Dept Agr.Bur Agr Econ, Stat andAgr No I, 
April, 1941 

Foote, Richard J , The mathematical basis for the Bean Method of Graphic Multiple 
Correlation Jour Amer Slot Assoc, 778 788. December, 1953 


Note 16 I These discussions, and an address by Meyer A Girshick summarized in 
the February, 1941, Journal of Farm Economics, have provided definite proof of the 
meaning of the graphic method In linear multiple correlation the graphic method gives 
results which tend to approach the lines secured by a least-squares solution, evtn if the 
first approximations are purely arbitrary guesses, while the speed of convergence 
depends on the inttrcorrelation among the independent variables The higher th«r 
mtercorrelation, the slower tends to be the speed of the convergence 
Note 162 If the standard error of estimate, adjusted for the degrees of fieedom, 
iSi 19 ki calculated as each new set of approximation curves is completed, it will 
show whether the gain m closeness of fit is sufficient to offset any additional flexibility 
introduced m the curves The validity of this lest, however, depends upon the user’s 
skill in estimating what value of m to employ 



SEaiOH V 


Significance of 
Correlalion and 
Regression Results 


CHAPTER !7 


The sampling significance of 
correlation and regression measures 


Early ia this bool: it was pointed out that when any stati'^tical measure, 
such as an average, is determined from a sample selected from a universe 
under study, the true value of that measure in the universe might be 
different from the value shown by the sample. Methods were discussed 
which enable one to estimate how far the average From such a sample 
may vary from the true average, for a stated proportion of such samples. 
Such estimates enable one to judge how much confidence may be placed 
in an average calculated from a given sample. 

Different Types of Sampling Models. The applicability of sampling 
concepts to correlation coefficients differs widely according to the nature 
of tlie universe from which the sample is selected and the manner in 
which sample values of the independent variables are obtained. The 
interpretation of regression coefficients is much less dependent upon the 
shape of the underlying universe (if any) or the manner in which sample 
values of the independent variables are chosen; however, estimates of 
the reliahUity of regression coefficients arc influenced by these factors. 
For convenience of exposition we shall distinguish between two principal 
situations or “models” — the correlation model and the regression model} 

’ Compare M. G. Kendall, Regression, structure and fimction."!! relationship. Part I. 
Biomeirika, Vol. 35, pp. 11-25. June, 1951. 

Kendall distinguishes between (!) situations in which salues of the independent 

779 


280 Sampling Significance of Results 

The con elation model requires strictly random samples from normal 
bivanate or multivariate umverscs This means that the joint frequency 
distribution of the two or more variables m the sample will be representa- 
tive of the corresponding distribution in the universe , that the distribution 
of each variable will tend to follow the normal frequency curve, and 
further, that the standard deviations of Y values for all values or class 
intervals of X, will be the same within normal sampling fluctuations * 
(If, as in the auto-stoppmg example of Chapters 3, 4, and 6, the s, 
increases, both m the universe and in the sample, with increasing values 
of X, and the differences in in different arrays can be corrected by using 
some transformation of Y such as, in this case, log Y, that is sufficient 
to satisfy the condition — but only if the transformed values are used m 
calculating the correlation coefficients ) For samples which meet the stated 
conditions, standard errors may be calculated for both correlation and 
regression stansucs derived from the sample which will provide a sound 
basis for probability statements as to their nearness to the corresponding 
parameters for the universe 

The term regression model will be used here to imply that values of the 
independent variable or variables are selected in advance by the investi- 
gator, as IS typical in controlled experiments, with no requirement that 


variables are decided before the sample is drawn — l\»e determined cest, and (2) situations 
in which the values of the independent varubles are randomly drawn from an under- 
lying multivariate distribution — the undetermined case Kendall s determined case 
corresponds to the regression model of this chapter Kendall s undetermined case is 
somewhat broader than our correlation model, for the underlying multivariate distn 
butions assumed by Kendall do not necessarily follow the normal curve 

Thus, in Kendall s undetermined case, both regression and correlation coefFicients in 
a sample would be regarded as estimates of the corresponding parameters of an under- 
lying universe which exists m nature or at least independently of any intentional influence 
on the part of the investigator who has drawn the sample The standard errors of the 
regression coefficients have the same interpretation in this case as in both Kendall s 
determined case and m the vmdtlennmcd normal muUivanatc distribution of our 
correlation model However, as noted by Kendall in another place, very little is known 
about the sampling distribution of the «JTrelaUoti coefficient except in the case of normal 
universes (See M G Kendall, The Advanced Theory of Statistics, Vol 1, p 346, 
Charles Griffin, London, 1943 ) Hence, the interpretation of the correlation coefficient 
IS uncertain unless both conditions of our correlation model are met, namely that the 
sample values of the independent variables be drawn randomly and that the underlying 
univene be normal 

See also George W Snedccor, Stattsuced Methods Applied to Experiments in Agri- 
culture and Biology, 5th ed, pp 413 14, Iowa State College Press, Ames, Iowa 1956 
for a distinction between models which is substantially identical with that used in the 
present chapter 

’ Similarly, sample values of X will tend to be normally distributed with approximately 
equal standard deviations for all values or class intervals of Y 




For fAeasurts of Regression end Correlation jSl 

the distributions of the independent \ariab1cs in the sampie w]]] be 
representatis-c of those in the universe— if indeed a “natural"' underlvinc 
universe exists at all. In this model measures of correlation base onlv a 
very limited meaning, and estimates of the rcUahility of rcgrc'.sion coeffi- 
cients apply only to a distribution of samples each dra%\n or constructed 
under exactly the same conditions as the given sample, tncludinc the 
same range and relative frequency of values of the independent variables. 


Reliability of Regression Coefficients 

Simple Regression. We have already noted that estimates of regression 
coefficients are less affected by differences in model than arc estimates of 
correlation coefficients. For both models, the reliability of the observed 
simple regression coefficient varies inversely with the standard deviation 
of the individual observations around the regression line, and directly 
with the square root of the number of cases in the sample. The standard 
deviation of regression coefficients from one sample to another can be 
estimated by the equation^ 



The application of this equation may be illustrated by data drawn from 
a sampling experiment using a universe of knoun characteristics, and 
with normal distributions of the variables. The first sample, of 50 cases, 
gave the value b^. = 0.175, with = 2.46, and .s. = 2.44. Computing 
the value of S(,^^ by equation (17.1) 

2.46 2.46 ^ „ 

J. = ;= = ; = 0.143 

"" 2.44\^50 17.20 

The variable i = {b — ^)jst, follows a “Student's” or .'-distribution 
which, as noted in Chapter 2, approaches a normal distribution as the 
number of degrees of freedom in the sample estimate becomes large. In 
this expression /9 is the true (but unknown) regression coefficient in the 
universe, and b and are values computed from a particular sample. 
The values of t from repeated samples are distributed symmetrically 
about zero even if the sample size is very small. Conventionally, proba- 
bility statements are based on the normal distribution sslicn the number 

’ In some textbooks, b,s would be used to represent the regression cociircitnl 
determined from the sample and would be used to represent the true parametf, in 
the universe. In this notation, the value for /!,- as showr. tn Tab.e 17.1 is 0,152. In 
consulting textbooks using this notation, we should not confuse this use of the -a ah 
the special definition given for it in Chapters 9 and 12. in eqaatiens (9.1) and {12.9j. 



282 Sampling Significance of Results 

of degrees of freedom in the sample estimate exceeds 30 If the number 
IS equal to or only slightly more than 30, statements based on the normal 
distribution are quite accurate for r-values between zero and ±2 but 
tend to exaggerate the level of significance actually attained for extreme 
values of t (less than —3 or more than +3) Given sample estimates with 
30 or more degrees of freedom, m 2 samples out of 3, on the average, the 
observed regression coefficient will he within one standard error of the 
true parameter If m this case we say that the universe regression 
coefficient lies within the interval 0 175 ± 0 143, or between 0032 and 
0 318, we are making a statement of a type which, if made for a succession 
of such samples, will be wrong I time out of 3, on the average Parallel 
statements for intervals of or ±3Sb„ are subject to the same 

interpretation as that given earlier, and to the same adjustments for small 
samples given in Table 2 3 However, n — (m — 1) roust be used in 
entering data in that table, instead of«, m order to allow for the additional 
degrees of freedom used up m determining the regression coefficients 
We can now illustrate the meanmg of the estimated standard error for 
our one sample by comparing it with the other regression coefficients 
obtained in the sampling expenment mentioned In that experiment, 
5 samples were drawn at random from the universe for each of 3 sizes 
of samples~30, 50, and 100 observations The values obtained are shown 
m Table 17 I 


Table 17.1 

Values op Shown by Successive Samples Drawn from the 
Same Universe, with Different Sizes or Samples 


30 

50 

100 

Observations 

Observations 

Observations 

0292 

0175 

0113 

0012 

-0 297 

0 120 

-0136 

0144 

0 303 

-0022 

0130 

0197 

0449 

0167 

0132 

Universe value 0152 

0152 

0152 


In this case, we see that the value for the universe, 0 152, lay within our 
confidence interval for P ss 0 67 Also, of the 5 samples with « = 50, 
4 had values of within the interval from our sample If our sample 
had happened to be the second one, however, its —0 297, would 




2S2 


For Measures of Regression and Correlation 


probably have differed from the universe \alue bv more than tuice its 
owa standard error. The salues in Table 17.1 abo illustrate ho'uMhc 
variation in successive sample values declines as the sire of the sarnn!" 
is made larcer. We could estimate the expected decline, of course bv 
recalculating equation (17.1) using 30 and lloo for and then comrarin's: 
the estimated variation for samples of 30. 50. and 100 v.ith the values 
shown in the table. (This is left as an exercise for the student.) 

It will be noted from equation (17.1) that varies jn\cr.vciv vith r.. 
the standard deviation of values of the independent variable in the sample. 
In the correlation model, j. is an estimate of the true parameter a, in the 
underlying bivariate normal universe. (If strictly random sampler were 
drawn from a * natural ’ universe which followed other ilian a. normal 
distribution, 5, would still be an estimate of the parameter o_. in that 
universe.) The regression model implies that values of .V in «ucccssi\c 
samples are controlled or selected on the same principles as thev were for 
the original sample. The estimated sampling variance of /y is therefore an 
unbiased estimate of the variance of A's from such a set of successive 


samples. The first tlirce pages of Chapter 18 give some arithmetic examples 
illustrating the principle that such a selection of A' values docs not change 
the value of b. Accordingly, if ordinary 95 per cent confidence intervals 
are used, the true regression coefficient in the universe is covered in 19 
out of 20 samples, just as in the case of similar confidence intervals for 
universe means. However, as is also demonstrated in Chapter 18. purpose- 
ful selection of extreme values of A' can reduce the standard error of 
very substantially relative to the value that would be obtained if the A' 
values xvere selected so as to follow' a normal frequency curve. Thus, 
for a given sample size, _ may show- much wider variations from one 
artificial universe to another than from sample to sample drawn from the 
same universe. 

Net (Multiple) Regression. The standard error of a net regression 
coefficient may be estimated by the equation^ 






-J 


c- 

•Jl-C’S 





(17.2) 


As with simple regression coefficients, the reliability of net regression 
coefficients is affected by the number of cases in the sample and the 
standard error of estimate. In addition, it is affected by how closely the 
given independent variable [A; as equation (17.2) is stated] can be eviim.itcd 
from the other independent variables (as A't and ,). The more highly 

‘ The standard errors of net regression cocfilcients can be dctermi.ned a: th.c venve 
time as the regression coeflicients. and as part of the same set of cornru!.;nar.v, hv 
various modifications of this formula. Sec Appendix 2. pages 40-? to 502, and 50/ to 516. 



284 Sampling Significance of Results 

the independent variables are interrelated among themselves, the less 
reliably can the net regression of Xi upon any one of them be determined 
Again, we can illustrate the use of the equation by results from a 
samphng study In this case 3 independent variables were used, and 
successive samples were drawn, 16 of 30 observations, 10 of 50, and 5 of 
100 The first sample drawn of 30 observations gave values as follows 
*1234 = 0583 , * j 3 „=* 0366 . * 24,3 = 0949 , Siai = 2U, ^2 = 2 53 , 
and i ?2 31 = 0 ^08 Substituting the appropriate values m equation (17 2), 

Table 17.2 


Distribution op Values for Net Regression CoEPnciENrs for Repeated 
Samples Drawn from the Same Universe 


Range of Values 

30 

Observations 

50 

Observations 

100 

Observations 

True 

Value 

Values for ij, ,4 





-0 79 to -0 60 

1 




-0 59 to -040 

0 




-039 to -020 

1 




-0 19 to -0 

0 

2 



0 to 0 19 

2 

1 

1 


020to 039 

6 

4 

4 

•i-0 320 

0 40 to 0 59 

4 

3 



0 60 to 0 79 

2 




Values for *in ?4 





-0 19 to 0 

2 




0 to 0 19 

3 




0 20 to 0 39 

5 

6 

2 

+0 377 

0 40 to 0 59 

2 

2 

2 


0 60 to 0 79 

I 

1 

1 


0 80 to 0 99 

2 

1 



1 00 to 1 10 

1 




Values for *14 03 





0 to 019 





0 20 to 0 39 


1 



0 40 to 0 59 

1 

1 



060to079 

8 

2 

2 


0 80 to 0 99 

3 

4 

3 

+0 824 

1 00 to 1 19 

4 

1 



1 20 to 1 39 


I 



1 40 to 1 59 







For Measures of Regression and Correiation 


Its 


the value of is found to be 0.287. The obsersed regression mav 
therefore be written as 0.583 ± 0.287. If this sample value departs from 
the true value no more than cam be expected in 2 cases out of 3. the true 
value will lie within the confidence inlcn-al 0.296 to 0.S70; or. ifit departv 
no more than can be expected in 19 eases out of 20. the true %aluc lie 


in the interval from 0.009 to 1.157. 

Table 17.2 shosvs the distribution of the values obtained from the various 
sets of samples, as compared to the true parameters. The true value for 
^ 12,34 0.320, or within the first interval just given. It may be noted that 

11 of the 16 samples of size 30 gave values for this cocfiidcm within 0.2S7 


of the true value, and all but one fell within 0.574 of it. Again this illustrates 
how the variability of statistics which tend to be distributed according 


to either normal or /-distributions may be estimated by appropriate 
error formulas, hence how the degree of confidence to be placed in con- 
clusions from a given sample may be judged. 

The qualifications that use of this error formula may impose upon 
regression results may be illustrated by a problem in which the theory’ 
of sampling was reasonably applicable, namely, the relation between the 
average amount of feed a herd of cows receives and the resulting milk 
production per cow. Table 17.3 shows these results for two different 
studies, regarding each set of observations as a sample that was random 
with respect at least to the dependent variable. 

This table illustrates two points: first, that the net regrc.'sions arc not 
very accurate even though the multiple correlation is 0.80 to 0.86; and 
second, that the reliability of the net regression differs from variable 
to variable, being much greater for some variables than for others. It 
is obvious that some of the net regression cocflicicnls are not at all 
statistically significant, whereas others indicate the probable relationship 


within a fairly narrow' range. 

Thus for the percentage of lime, with the standard error as large as the 
regression coefficient, there is 1 chance in 3 that the sample net regression 
coefficient differs from the universe value by more than the sample 
regression itself, and 1 chance in 6 that the sample regression is of opposite 
sign from that in the universe. With the total digestible nutrients, on the 
other hand, with the standard error only IS per cent of the observed 
value, there is but little chance that the observed value differs from the 
universe regression by more titan 36 pci cent, and very little chance that 
it differs as much as 50 per cent. 

If the regression equation is to be used solely as a basis for making 
new estimates of the value of the dependent factor to be expected for 
given values of the independent factors, then the accuracy of the .seteral 
net regression coefficients docs not make such a great difference. Any 



2S6 


Sampling Significance of Results 


Table 17.3 

Standard Errors of Partial Regression Coefficients, in Per Cent of 
THE Value of the CoEFfiaENT* 


Item 

Wisconsin 

Study 

Minnesota 

Study 

Number of observations 

95 

77 

Number of variables 

10 

8 

Multiple correlation, adjusted for 

number of variables 

0805 ±0 039 

0 862 ±0034 

Standard Error of Regression Coefficients^ 


Independent variable 

Total digestible nutrients 

178% 

170% 

Nmnixve ratio 

18 4% 

Ul% 

Per cent of protein “good” 

420% 


Per cent of lime 

99 9% 


Per cent summer feeding 

25 9% 


Per cent silage 

319% 

203% 

Fat test of milk 

157% 

5 5% 

Per cent fall freshening 

27 4% 

175% 

Value per cow 

39 7% 


Age of cows 


26 5% 

Per cent gram in ratio 


29 9% 


* Mordecai Ezekiel, The application of the Iheoiy of error to multiple and curvilinear 
correlation Journal of the American Statistical Association, Vol XXIV, No 165A, 
March, 1929, Supplement p 103 

t The coethcients are for the net regression of milk production on the factors stated 
The original article gave the per cent figures m terms of the “probable error” (P E = 
0 6745 of the standard error) 

deficiency in one may be compensated for by an excess in another (This 
does not hold true, however, if estimates are made for extreme values of 
vacwJjles wbAse. tegresxj/jna ate wifejpyt, te las^e etrots See CbA’ptec 19 
on this point ) But if the major interest is not in the total estimate, but 
m the changes in the dependent factor with changes m each particular 
independent factor, then the reliability of each particular regression coeffi- 
cient becomes of real importance In the illustration cited, for example. 
It would not do to know merely that the milk production per cow varied 
both with protein content and with lime, if it was desired to know how 
much to allow for protein and how much for lime in compounding a 
ration Instead, the standard errom indicate that the influence of protein 




2E7 


For Measures of Regression and Correlation 


(as represented in the “nutritive” ratio) has been fairlyaccuratcly mca-«rcx!, 
whereas the influence of hme has not been acctirnteiy measured at all. 
Not much confidence therefore can be placed in the" conclusions as to 
this latter factor. 

In any regression study where the results arc based upon a sample of 
observations drawn at random from a known universe, and where anv 
importance is to be attached to the values found for the several rceressmn 
coefficients, it is essential that the standard errors of each of those 
coefficients be determined and considered. As is illustrated in the example^ 
just discussed, a sample may have a high multiple correlation atid yet 
yield regression coefficients for some %-ariablcs which are almost entirely 
the result of chance fluctuation, and therefore arc not statisticaliv signifi- 
cant. This may occur even with moderately large samples, such as the 
sample of 95 cases in the first example just considered. Computation, 
presentation, and discussion of the standard errors of the regression 
coefficients are therefore vital parts of any such multiple correlation study. 

Regression Line. Not only may the observed slope of the rcgrc'i^ion 
line vary from the true slope, but the elevation of the line, as oh^erxed 
from a sample, may vary from the true elevation. Equation (17.1) has 
already indicated a way of determining the standard error of the regression 
coefficient, and so of estimating the probable range ujthin which the true 
slope lies. The height of the regression line is most accurately determined 
for the mean estimated value of the dependent factor, corresponding 
to the observed mean value of X, the independent factor. If we define 
the mean as 

Afy = - 1 - h.^M. 


we may find its standard error by the formula 


s 


•u.. — 


Vn 


(17.3) 


The standard error of the whole regression line may now be determined 
from equations (17.1) and (17.3). We may illustrate by data from the 
cotton-yield problem used as an example in Chapter 8. With 14 observa- 
tions, the values were = 16.70, = —2.261. — 1.97. 5^. = 8.28, 

j,, = 0.73, My = My. = 30.64, s, = 14.43. 

My. = -2.261 -b (16.70X1.97) = 30.64 


S-28 , 

. = - 7 = = 2.21 

* ’ Vl4 


8.28 

S = = O.03 

" 0.73\'^14 



288 Sampling Significance of Reiuhi 

Since the estimated value } ' equals + ^( 1 ), the standard error of 
the estimate for an) value of x will include the standard errors of and 
of b(x) Standard errors are standard deviations, hence they should be 
summed by adding their squares The standard error of Y', for any 
particular value of x, is therefore given by the equation® 

V = + (Sj, *)’ (17 4) 

By using this relation, the calculation of the standard error of Y', 
for selected values of X, is shown in Table 17 4 


Table 17.4 


Selected 

Values 

of 

X 

Departures 

from 

Mean, 

X 


Calculation of 


= 3 03x 


-2 21* 

+ ^h. 

V 

097 

-1 00 

-3 030 

91809 

4 8841 

14 0650 

3 75 

1 47 

-0 50 

-1 515 

2 2952 

4 8841 

71793 

2 68 

197 

0 

0 

0 

4 8841 

4 8841 

221 

2 47 

0 50 

I 515 

2 2952 

4 8841 

71793 

2 68 

2 97 

1 00 

3 030 

91809 

4 8841 

14 0650 

3 75 

3 47 

1 50 

4 545 

206570 

4 8841 

25 5411 

5 05 

3 97 

200 

6060 

36 7236 

4 8841 

41 6077 

645 


There are 14 cases, subtracting the one extra constant involved jn 
correlation determinations gives 13 as the number of observations with 
which to judge from Table 2 3 the significance of these standard errors 
Taking values midway between those for 10 and for 16 cases, we find that 
a statement that the true values and do not differ from the observed 
values by more than the calculated standard errors will be wrong for 34 
out of each 100 such statements., on the aveca|p Similarly., the statement 
that they do not differ by more than twice the calculated standard errors 
will be wrong for 7 out of 100 such statements, on the average The 
chances are therefore 93 out of 100 that the departure of this regression 
line from the true hne will not be larger than the confidence intervals 
just calculated Plotting ISy- above and below the corresponding values 

* Holbrook Working and Itarold Hotelling Applications of the theory of error to the 
interpretation of trends, Journal of the American Statistical Association Papers and 
Proceedings, Vol XXIV, pp 73 85, March supplement, 1929 





for Measures of Regression and Corrchzion 2S9 

of Y’, given by the regression line, shows this intcri-a!. These limits .-r-' 
plotted in Figure 17.1, together with the original obsenatiom and the 
regression line. The lirnits within w hich the universe rcercssion *'prob'^bT*' 
lies could be shown in a similar manner for any other de^ircd Icu'l of 
probability. As is evident in the figure, the probable true position of the 
line becomes very uncertain as the limits of the data arc approached. 



Fig. 17.1. Linear regression of cotton yield on irrigation water applied, 
and range wathin which the (rue relation probably lies. 

In most studies of statistical relationships between variables the regres- 
sion line is the most important result of the study. Tlic confidence that 
can be placed in the line determined from a rando.m s?,mple is no greater 
than is indicated by the probable error of its slope, or the standard error 
zone of its position. Accordingly, the final statement of the regression 
equation should always indicate clearly the standard errors of the net 
regression coefficients and the number of observations on whicli the 
conclusions arc based. This will serve to caution the reader of the extent 
to which the values may vary from the true value simply due to chance 
fluctuations of sampling, and so caution him not to attach more importance 
to them than their significance justifies. It is not custoinar;.’ in most iields 




I 

290 SampUng Significance of ResuJtt 

to tabulate or chart values of the standard error zone of the regression 
line in published reports However, the investigator should at least 
calculate this error zone for a few representative sets of values of the 
independent variables for his own g;uidaiice and reflection, and it would 
be appropriate to pass on to his readers at least a general statement as to 
the reliability of the regression line in a few broad ranges of values of the 
independent variables 

Regression Curves Fitted Mathematieally. Where regression curves 
are obtained by fitting definite mathematical equations to the data, the 
standard error of the curve may be judged by the same methods previously 
presented for determining the standard errors of net regression coefficients. 
Thus, if a parabola of the formula 

yi = a + bXi + b'Xl 

is determined, the standard errors of b and b' may be determined by 
equations (17 2) and (17 4), treating A'j and Xl as two independent 
variables This would involve computing the standard error zone of 
/(A'l) by the equation 

sf , ,, = + (,^)2 + XjU + (s, u)‘ (17 5) 

n 

where V = Xl, and « = — A/x| 

Probability statements concerning the range within which the true 
curve probably lies may then be formulated just as has been illustrated 
for a linear regression Similarly, if net regression curves are determined 
by fitting mathematical equations m three or more variables, an extension 
of this same method may be used to judge the reliability of each of the 
net regression curves so obtained ® 

Regression Curves Determined Graphically. For regression curves, 
either simple or multiple, determined by graphic processes, no such exact 
mathematical estimation of the standard error intervals is possible Ex* 
perimental studies summarized m earlier editions of this book have given 
some indication of the range of sampling errors in such curves, and how 
Wat TTijfy bft: etWirratei , Vrfl these icpproATTmAiora to the sttmtoni-entn 
zone are not as reliable as the equations shown thus fai for algebraically 
determined curves, and more work is needed on this problem ’ The 
experimental work did show that the reliability of graphic curves varies 
inversely with the standard error of estimate for the whole sample, and 
apparently tended to vary inversely with the intercorrelation among the 
* Henry Schultz The standard error ofaforecast from a curve, /owma/o / the American 
Statistical Association Vol XXV, pp 139-185, June. 1930 
’ See the second edition of this book, pages 327-39 



29! 


For Measures of Regression and Correlation 

independent variables, just as is the case v.i£h net rceresvon 
but that in addition the reliability of the rc<.uils varied v,i;h the ihic'» ne-s 
of observations along the regression curve for each inder>ender.t 
being more reliable where observations were rca^^onab’ly frequent, and 
less reliable where the observations thinned out. Nov," th.-it cN-ctroniC 
calculators make elaborate computations much Icjs burriunsf^^ 1 c. in 
cases where graphic curves base been fitted for lack of ans thcorccca! 
reason to expect a given type of equation, the reliability of the rcerc^'^'ior,' 
could be determined by the following proces'^: select =e%e,'a? sets of 
equations that could be reasonably expected to approximate the eraph.ic 
curves, and fit them simultaneously by the method explained on paces 
205 to 210; after determining the set that came nearest to reproducing the 
graphic curves with the smallest number of constants, determine the 
standard-error zone of each curs e by the methods indicated in the preced- 
ing section. This effort would be worth making, iiowcver, onlv if it vsas 
desired to make inferences from the sample as to the unnerse: or if it 
was important to determine which net regression curves had been deter- 
mined with so little significance that they should be excluded from the 
solution. 

An alternative estimate of the reliability of graphic regression curves 
may be made by determining the regression equation (15.10): 


a; 


■'J.2'3’4' 


h]o' 3<.),[/2 (AN)] -b hj;- "•4-f/^n ( A 3)] -b h’,.c~3-fyj{A .)] 


To compute the new constants required in this equation, the functional 
readings corresponding to the independent variables are correlated with 
the original values of the dependent variable. Thus, in Table 14.15, the 
values read from the final curves, shown in the fourth, fifth, and sixth 
columns, would be substituted for the onginal independent variable.' in 
running the multiple correlation with A'j. If A'.’.. AN. etc., are u-ed to 
represent these transformed values, the data to be correlated for the first 
four observations would be: 


AN a:; 

ft'.' 

•» 

V >'' Vf 

Ai -‘J ,'3 

A-; 

7.4 11.7 

12.3 

24.5 8.4 J2.2 

12.2 

7.9 13.0 

11.8 

33.7 8.8 9.9 

12.2 

If the net 

regression 

coefficients come out 1.0. tliat 

indicates 


change need be made in the curves. If any /> come' out other than unitv. 
however, the values read from the corresponding curve should be adjusted 
as indicated by the regression results. The adjustment may be wcikcd 
out as follows: 



292 Sampling Significance of Results 

In the same way that /a (A^ was used to indicate the values read from 
the final set of approximation curves, let/glxj) represent the deviations 
of those readings for each vanable from the average of all the readings 
for the particular variable That is, for each observation 

The regression equation (15 10) may then be restated 

Xj = bi2 + ^13 ^11 

and the corrected functions will be as follows 

/=(**) - ^12 34r;T(^a)] 

M^a) - *13 2<in(^a)] 

=*14 23 If* (* 4 )] 

The correction merely expands or contracts the curve, making all the 
high estimated values higher and all the low values lower, or vice versa 

When equation (15 10) is computed for the values m Table 162, the 
following multiple regression equation is obtained 

X'” =. -6 1099 + 1 0926[f^{Xt)] + I 1257I/r(X3)] + 1 1094[/4'”(Ar4)] 
(0 1509) (0 3601) (0 1982) 

As expected, all the b’s come out close to 1 00, so only small change 
in the shape or slope of the curves would result from applying these 
corrections The adjusted standard error of estimate for this new set of 
regressions, and the index of multiple correlation as shown by the H 
for the multiple regression equation, are practically unchanged from those 
calculated in Chapter 16, indicating that the slight shift of the curves has 
made only slight improvement in the over all fit to the data, and that there 
is no great significance m the final changes in their shape The fact that 

indicates that all three regression curves are significant 

Equation 1510 thus provides a partial answer to the question of 
estimating the rebability of curvilinear net regressions obtained by graphic 
methods It is only a partial answer, as more precise mathematical 
approximations to each net regression curve would use up two or more 
degrees of freedom and would require us to estimate two or more constants 
which determine the shape of eidi curve, such as The standard 

errors for equation (15 10) do not tell us whether or not the best or most 



For Measures of Regression and Correlation 293 

logical shapes of/gCXo), etc., have been attained; they .simpiv .show that 
the particular approximations to these shapes represented by etc , 
do have a statistically significant association with AV (Other alternative 
ways of adjusting the final graphic regression curves by supplcmcntarv 
calculations, presented on pages 400 To 404 of the second edition, arc 
omitted from this edition for lack of space.) 


Reliability of Correlation Coefficients 

Correlation Coefficients, Forlargcrandomsamplesdrawn from normal 
bivariate universes in which the true correlation p is not far from zero, 
the distribution of the observed correlations in successive samples will 
tend to be nearly normal, and the standard error of the cocflicient may 
be approximately estimated by the formula 




'VI 


(17.6) 


If the sample is small, however, no simple equation will be adequate. 
Further, if the true value of the correlation is much smaller or larger than 
zero, the distribution of sample values will be badly skewed, as will be 
evident intuitively if we consider the probable distribution of value.s in a 
range from —1 to 1 with the mode (say) at 0.75, The reliability of the 
observed correlation coefficient must therefore be estimated by other 
methods than equation (17.6) in most eases. 

Certain of Fisher’s methods for determining the reliability of observed 
correlations may be put into simpler form for general use. as shown in 
Figure 17.2. This figure is based upon the idea that, although we cannot 
be sure of the true correlation existing in the universe on the basis of the 
correlation shown in a given sample, we can estimate a minimum value 
for the tiuc correlation, with a given chance of being wrong. Figure 17.2 
has been calculated, from Fisher’s results, to show such probable minimum 
correlations in the universe, with the probability that the statements 
based on the figure will be wrong for I sample out of 20. on the average.^ 
The results Itav'C been plotted for different sizes of samples and observed 
correlations. Thus, if a random sample of 20 gives an observed correlation 
of 0.70, the figure shows at a glance that we can say that the (rue correla- 
tion is greater than 0.44, W'ith the expectation that .such statements will 
be wrong only in 1 sample out of 20, on the average. Similarly, for an 
observed correlation of 0.55 with a sample of 35 cases, reading from the 
line for observed correlation at 0.55, and interpolating between « = aO 

* For the source of Figs. 17.2 through 17.5, see .Appendix 3, note 4. 



Sampling Significance of Re:ults 



Coirelalion observed m sample 


Fig 171 Under conditions of random sampling, 1 sample out of 20, on the 
average will show a correlation coetlkient with a i; value as high as that “observed 
in sample ' when drawn from a universe with the stated true correlation 

and n = 40 gives 0 32, which means that we can say that the true correla- 
tion IS greater than 0 32, with the same degree of confidence The figure 
can be used \n a similar manner for any other size of sample up to 100, 
and any observed correlation 

Figure 17 2 deserves close study, for it tells a great deal about the 
sampling reliability, or, rather, unreliability, of correlation coefficients 
The bottom line, for example, shows that, when samples are drawn from 
a universe where the true correlation is zero, 1 sample out of 20 will show 
a correlation as high as ±0 60, on the average, with samples of 10 cases, 
as high as ±0 49, with samples of 15 cases, and as high as ±0 35, even 
With samples of 30 cases Similarly, if the samples are drawn from a 
universe where the true correlation is 0 50, 1 sample out of 20, on the 




or Measures of Regression and Correlation 29 $ 

verage, wll show a correlation as high as 0.81. v,ith samples of 10- 
.s high as 0.73, with samples of 20; and as hich as 0.69. wit.h sample., cf 
!0. Many other similar comparisons can be made readily. For example 
f the true correlation is 0.80 and samples of 10 cases arc mod, 5 per cent 
)f the samples will show correlations as high as 0.93. These facts do not 
nke into account the tendency of many students to examine a number of 
possible independent variables and to select for more detailed studv 
those which show the highest correlation with the independent factor. 
If that is done, the possible minimum correlation in the umxersc. corre- 
sponding to the correlation observed in the sample so selected. v,iil be 
much lower than would be estimated from Ficurc 17.2. 

Multiple Correlation Coefficients. The reliability of coefficients of 
multiple correlation varies not only with the correlation and the sire of 
sample, but also with the number of independent variables. Fisher has 
developed an exact method for judging the significance of observed 
coefficients of multiple correlation.® Figures 17.3, 17..1, and 17.5 provide 
a simple method of applying his conclusions for multiple correlation 
coefficients, in the same svay that Figure 17.2 provides for simple correla- 
tion coefficients. For problems involving 3, 5, and 7 independent facions. 
respectively, these figures show the approximate minimum true correlation 
that probably exists in the universe with any size of sample up to 100. 
and for any observed correlation, with the probability that the statements 
based on the figure will be right for 19 samples out of 20, on the average. 
Thus if, with 30 observations, a correlation of /I, = 0.80 should bo 

obtained, we can say that the true correlation (from Figure 17.4) is at least 
0.58. Similarly, if for 50 observations a correlation of 2?; =0.62 

were obtained. Figure 17.3 gives 0.42 as the probable minimum correlation 
in the universe. These conclusions, of course, arc subject to the cautions 
given on pages 279 to 281 with respect to coefficients of simple correlation. 
Problems with 2, 4, or 6 independent variables may be considered by 
interpolating between the corresponding values given for 1, 3. 5. or 7 
independent variables. 

Considering the problem mentioned above, where a sample of 30 
observations showed = 0.538. Figure 17.3 gives a value of 0.16 
as the probable minimum correlation. From the single sample we could 
then say that the true correlation is at least 0.16 in the universe from which 
the sample was drawn, with 1 cliance in 20 of being wrong. 

Figures 17.3, 17.4, and 17.5 show the possibilities of getting high 
correlations from a random sample, even when there is little or no correla- 
tion in the universe from which that sample was drawn. Titus, for a 

’ R. A. Fisher, The General sampling distribution of the multiple correhuon coeiiu'cr.i. 
Proceedings of the Royal Society, A, Voi. i2I, pp- 6S-S~673, 192S. 



Sampling Significance of Results 


Minimum corrdatian In universe for varying 
observed correlations and size of sample 




COrreiabon observed in sample 

F5g 17 3. Undercondil(OiuofraRdomsainpluig,lsajnpleoutof20,onlheaverage, 
will show a multiple correlation as high as that ’observed m sample,” when drawn 
from a universe with the stated inie multiple correlation, m the case of multiple 
correlation with 3 indcfwndent variables 

independent variables. Figure 17 3 shows that, if samples of 15 observa- 
tions are used, m 1 sample out of 20, Ri 234 will be as large as 0 69, even 
if the correlation m the universe is zero, and as large as 0 78, even if the 
true correlation m the universe is only 0 40 Similarly, if there are 7 
independent variables, Figure 17 5 shows that, if samples of 20 cases are 
used, in 1 sample out of 20, on the average, 23458:8 ^ 

0 79 with zero correlation m the universe, 0 85 with 0 50 in the umverse, 
and 0 91 with 0 70 in the umverse Even with samples as large as 100 
cases, 7?j 2345478 m 5 per cent of the samples will be as high as 0 37 for 
samples drawn from a umverse with zero correlation, and as high as 
0 57 for samples drawn from a umverse with 0 40 as the true correlation. 




For Measures of Regression and Correlation 


297 



Fig. 17.4. Under conditions of random sarnpiing. ! sample out of20, on the average, 
will show a multiple correlation as high as that *'ob5cn.ed in sample," when drawn 
from a universe with the stated true multiple correiation, in the case of muUiple 
correlation with 5 independent variables. 


Figure 17.4 gives similar probabilities for 5 independent variables. Many 
other combinations of size of sample, true correlation in the universe, 
and observed correlation for 5 per cent of the samples are given in these 
figures. 

If the several independent variables in the muhipic correlation studs 
had been selected by considering a large number of possible indepenoent 
variables, and by retaining only those which showed the highest grosa or 
net correlation with Xj, there is a much larger possibility of tlte correlation 
in the sample exceeding the true correlation in the universe bv a wide 
margin. In fact, it is almost certain, to be erroneously high. If error 




Fig 17 S. Under conditions of random sampling, I sample out of 20, on the 
average, will show a multiple correlation as high as that “observed m sample,” 
when drawn from a umverse with the stated true mnltiple correlation, m the case 
of multiple correlation with 7 independent variables 


calculations are to be used to judge the sampling sigmficance of the 
cotielatvons oi observed, the variables mast ^ ariected paitlj 

on logical or deductive grounds {as discussed at length m Chapter 26) 
rather than on any such basis of empincal selection of those which show 
the apparent closest relation It should always be remembered that, if the 
choice IS purely empirical, the next following time period (in the case of 
economic, meteorological, or other time senes) might readily reverse the 
order of apparent importance of the several vanables 
Examples of the Sampling Fluctuation in Multiple Correlation 
Coefficients In the sampling expenment referred to earlier, some 



7<)9 


For Measures of Regression and Correlation 


results of which were given in Table 17.2. the coefrldcnis 
correlation observed were distributed as shown in Tabic 17.5. 


of multiple 


Table 17.5 


Distribution- of Value.s for Multipli; Corp.fuation Co-fnmKfs for 
Repeated Samples Drawn from the Same UNivFREn (True Valui 0.5fi5) 


Range of Values 

30 

Observations 

50 

Observations 

too 

Ob^erv aliens 

0.300-0.399 

4 

J 


0.400-0.499 

3 

5 


0.500-0.599 

4 

1 

« 

0.700-0.799 

4 

3 

1 

0.700-0.799 

1 



This table illustrates how widely sample values of R may \ary in small 
samples. In the first sample of 30 obscirations. for example,' 7?!-.,, = 
0.538. But from Table 17.5, the sample selected might have happened to 
be one of the samples that gave values as low as near 0.300. or as high as 
over 0.700. From Figure 17.3, wc estimate that such a correlation might 
be observed, in I sample out of 20, from a universe with a true correlation 
as low as 0.1 5. Table 1 7.5 illustrates the need for this caution in interpreting 
such -coefficients from small samples. 

Indexes of Correlation. No precise mathematical estimates arc 
possible for the standard errors of simple and multiple indexes of correla- 
tion based on graphic curv-es. Where they arc ba.sed on mathematically 
fitted equations, their reliability may be estimated by tlic same interpreta- 
tion from Fisher as is given in Figures 17.2 to 17.5. using (m — 1) in 
place of the number of independent variables. (For complex curves with 
many constants, this will involve going back to his original article to 
calculate the probable minimum value in the universe, for P — 0.95.) 
For graphic curves, the same interpretation may be made, on the assump- 
tion that, except for the degrees of freedom involved in fitting the curves, 
the correlation index is simply the correlation coefficient between and the 
X' values calculated from the curve or curves, as at least a fir.vt .approxi- 
mation to its reliability. In view of all the qualifications as to the meaning 
of correlation coefficients or indexes, it is not usually considered adv i.-ablc 
to make much effort to interpret their sampling significance. 

All of these calculations of the sampling significance of r and R apply 
rigorously only to samples drawn from a natural universe of multivariate 
normal distribution (the correlation model). 


300 


SampUng Significance of Results 


Adjustments for Numbers of Observations and Constants 


During the 1920 s and 1930 s— and occasionally more recently— 
research workers were sometimes ovenmprcssed with the importance 
of results indicated by the values of r and R they obtained on the basis 
of studies with only small numbers of observations As noted earlier, 
a straight line fitted to two observations will pass precisely through both 
In this case r = 1 and 5" = 0, even though a larger number of observations 
from the same umverse might show little or no correlation With small 
numbers of observations the unadjusted values of r® or tend to over* 
state the proportion of variation in Y that is associated with X in the 
universe from which the sample was drawn This is true even if the 
values of X are fixed by an experimenter, m this case the universe is one 
having the full array of values of Y for each of these fixed values of X 
Only a single value of Y for each value of X appears m each sample 
Since the squares of the unadjusted values of the standard error of 
estimate and the standard deviation from a small sample also tend to be 
biased and to underestimate the true variance in the universe from which 
the sample was drawn, adjustments for these coefficients were introduced 
earlier in the text, m equations (2 1) and (7 3) These adjustments are 


The first of these equations gives an unbiased estimate of the square of the 
true parameter cr in the universe, the second gives an unbiased estimate 
of the parameter — the vanation not associated with, or accounted for, 
by X It seems logical, therefore, under some circumstances to calculate 
adjusted values for r* or R} that are consistent with the two measures above 
Derivation of the Adjustments The true correlation />„ m the 
universe of the preceding paragraph may be defined by the following 
relationship 


c 


(177) 


Given the unadjusted value of from a particular sample, we can obtain 
an estimate of which we will designate by substituting m equation 
(17 7) the sample values of and 5,, adjusted by equations (2 1) and 
(7 3) to eliminate the bias previously noted The equation then becomes 



1 -^.' 


4 

fe) 





for Measures of Regression and Correlation ^Ol 

^ Similarly, substituting the values of sr and S- in the equations for 

He' ^1.23. ..t’ ^{.23.. .t [equations (7.S), (12.6). and fl5.7j]. the .'tdju'.ted 

values are; 


-o 


• 

n / 




I - (1 

- T;5( 


Rises... t 

_ 1 .. .1 
* o 

l-(l 


' n — n; ’ 


L ^ ^ 

— m/ 

H.23...t 

i 

1 1 - (I 
i 

~ H.23 

n — m 


(17.9) 

Equation (17.9) thus provides a formula where subscript /: is the 
number of variables and m is the total number of degrees of freedom 
used up in fitting the regression equation, by which, squares of the coef- 
ficients of multiple correlation and indexes of simple or multiple correla- 
tion can be adjusted to avoid consistently overestimating the closeness 
of relationsliip in the universe. 

In using equations (17.8) and (17.9). if the value for the adju.stcd coef- 
ficient comes out a minus quantity, then 0 should be used for the adjusted 
value. 

These adjusted values, like correlation coefficients generally, arc of 
limited usefulness except when the sample has been randomly drawn from 
a normal universe, and has not been selected in such a way as to mal.c the 
distribution of any variable in the sample different from that in the uni- 
verse. (Chapter 18, following, gives arithmetic examples of how seriously 
the correlation in the sample may be changed from that in the univci-sc 
if such artificial selection of the sample is permitted.) 

The meaning of the adjusted values, , i or niay also be 

explained as providing an unbiased estimate of the per cent of exact 
variance most probably associated with particular sets of values of the 
independent variables in a universe where those latter values remain 
fi.xcd, but values of the dependent variable arc subject to random variation. 

If it is desired to adjust the coefficient of partial correlation, equation 
17.9 may be used for that also. 

Comments on Use of the Adjustments, The .smaller the number of 
observations, the larger the number of independent %ariablc« considered, 
and the more complex the curves employed, the greater will be the tendency 
for the observed (unadjusted) standard error of estimate to understate 
the true error of estimate in the universe, and for the observed correlation, 
simple or multiple, to overstate the true correlation in the universe. 



302 Samp//ng Sf^n/ficonce of Result* 

Where freehand methods ha\te been used to determine net regression 
curves, there is particular need to be careful m appraising the relative 
importance of the relationship as measured by the unadjusted /* Iix 
such cases, particularly if the sample is small and the number of inde» 
pendent variables is large, 7* will give a more realistic estimate of the 
relationship in the particular universe from which the observations were 
drawn, or of the importance of the relation observed in the sample 

In current practice, the adjusted coefficients f and f are seldom used 
When the values of X are set by an experimenter, this adjustment is 
implicit in the analysis of variance table which is generally used m apprais- 
ing experimental results (see Chapter 23) The limited usefulness of the 
correlation coefficient in controlled experiments has already been noted 
The reliability of the regression coefficient and the regression line can be 
tested without reference to r in both experimental and non-experimental 
situations (see Chapter 23 and earlier sections of this chapter) 

When a sample has been drawn at random from a bivariate or multi- 
variate normal distribution (or close approximation thereto) the investi- 
gator may be interested m a direct test of the reliability of r or /? as an 
estimate of probable correlation m the universe Figures 17 2 through 
17 5, based upon the unadjusted values of the coefficients, provide such 
a test, as explained earlier 

Wustrations of the Adjustments as Shewn by Sampling Experiments, 
Standard Errors of Estimate The tendency for the unadjusted 
values of correlation coefficients or indexes to be biased upward in small 
samples, and for standard errors of estimate to be biased downward when 
they are based on small samples, was shown very clearly m an experi- 
mental study, already referred to, made by the senior author Two 
universes of known correlations were employed, and a number of samples 
were drawn for each of the three sample sizes 30, 50, and 100 The curvi- 
linear net regressions were then determined for each sample separately 
(by the successive approximation method) and the standard deviations 
were worked out for the residuals m each case The central values of these 
standard deviations of the residuals, for the samples of each size, were 


Observed Standard Deviation of s* 

Number of Observations 

Universe 1 Universe 2 


30 

1 95 

1 53 

50 

218 

1 64 

100 

221 

1 72 

Entire universe 

240 

1 80 


These values are the median values obserrcd. 



For Measures of Regression and Correlation 20'^ 

It is quite evident from these results thjit the unadjuMcd 
tended to give standard errors of estimate smaller than that for the t.r cr e 
as a whole. The smaller the sample cmplojcd, the creatcr the o', erc^tim.ite 
of the reliability of the estimates. 

The shape of the universe net regression curses indicated th.at about 
10 constants would be needed to approximate the entire regression cquat-on 
with a mathematical function. Using 10 for n; m the cqu.ition 


5r. 



and carrying the same adjustment through for all the observed values 
shown, adjusted standard errors of estimate were obtained as follows: 


Number of 

Value Used 

Universe 1 

Unive.'xe 2 

Observations, 

for 

Observed 

Adjusted 

Obverved Adjusted 

n 

m 

S; 



30 

10 

1.95 

139 

1.53 1.87 

50 

10 

2.18 

2.43 

l.&t 1 «3 

100 

10 

2.2! 

133 

1.72 i ''2 

Entire universe 


2.40 


l.SO 


In each case the adjusted values came much nearer to agreeing with the 
true value for the universe than did the unadjusted values 
Multiple Correlation Indexes. The modal values of the indexes 
of multiple correlation for the repeated samples from eacii universe were 
found to be as follows, in comparison with the true index of correlation 
for the entire univ'crse: 

Number of 

Observations in Sample 

30 
50 
100 

Entire universe 

In every' case the modal obsened correlation exceeded the true correla- 
tion in the universe, and the smaller the size of the sample, the larger ihe 


Observed Index of .Multiple Correhiion 
m Sample' Drawn from Same Umverse 

0.77 

0.71 

0.6S 

0.62 



304 Sampling Significance of Resufts 

excess Adjusting these values by equation (17 9), results are obtained 
as follows 


Number of Observations, 

Value Used 

Observed 

Adjusted 

n 

form 

1 

i 

30 

10 

077 

064 

50 

10 

0 71 

0 63 

100 

10 

0 68 

0 65 

Entire umverse 


0 62 



Here again the adjusted modal values are m much closer agreement with 
the umverse value than were the unadjusted values 

Testing for Curvltineerlty. The adjustments for degrees of freedom 
provide one indication as to whether a curvilinear regression line or lines 
really fit the observed data better than does a straight line or lines, or 
whether a complex curve gives a belter fit than a simple curve Unless 
the adjusted standard error of estimate is lower for the curve (or curves) 
than for the straight hnes, any reduction m the unadjusted S from the 
linear equation may be regarded as a fictitious improvement in accuracy 
As we take additional variables into account, or use up more degrees of 
freedom by employing more constants m the curves, we obtain a certain 
amount of spurious increase in the apparent correlation, and a spunous 
decrease in the unadjusted standard error of estimate Correcting for n 
and m removes this spurious effect, and thus is of direct value, even if no 
inferences arc drawn as to universe values More precise tests of the 
significance of the improvement m correlation obtained by fitting curvi- 
linear regressions are given m Chapter 23, pages 403 to 411 

Summary 

Coefficients of simple and net regression, when determined from a 
sample, may vary more or less widely from the true value for the “natural” 
universe from which the sample was drawn (or for the artificially con- 
structed umverse which we desenbed as the regression model) In either 
case, confidence intervals can be calculated which serve to indicate how 
much rehance can be placed on each regression coefficient observed 

Coefficients of simple or multiple correlation have clear-cut sampling 
significance only in the case of samples based on the correlation model 
For such samples, means are available for estimating the probable mini- 
mum correlation m the universe forP = 0 95, from the fact that I sample 
out of 20 might show a correlation as high as that obtained in the sample, 
For the sampling characteristics of ft see Technical Note 8 m Appendix 3 



30S 


For fAcasures of Regression and Correlation 

even if the universe had only the specified correhtion. When uni'-cr'c^ 
and samples are artifically constructed or selected ('the regre-iion nu'deU. 
correlation coefficients or indexes have no significance a.s a basis for 
inferences as to the underlying “natural" universe if any, but apply onlv 
to the very restricted universe of the regression model. 

There is a tendency to upward bias in calculating coefficients or indexc' 
of correlation from small samples or in problems which invohe many 
variables or complex curves. Approximate adjustments arc presented to 
correct for this bias. These are especially important, with small samples 
or many constants, in comparing the significance of alternative solutions 
which use up vaiydng degrees of freedom. 

The standard error of estimate is also subject to downward bias 
unless adjusted for degrees of freedom, but adjustments for that lia'.e 
already been presented in earlier chapters. 


references 

Fisher, R. A., The general sampling distribution of the multiple correlation coenhicni. 

Proc. Roy. Soc., A, Vol. 121, pp. 655-<}73, 1928. 

Ezekiel, M., The application of the theory of error to multiple and curvihnear corre- 
lations, Proc. Amcr. Slot. Assoc., Vol. XXIV, March, 1929. 

^ A first approximation to the sampling reliability of mult.ple corrclunn 

curves obtained by successive graphic approximitions. Ann.:!^ of .Man. Sun . 
Vol. 1, September, 1930. 

Schultz, Henry. The standard error of a forecast from a curve, Jo-ir. .-frier .v/uf . Astoc. 

pp. 139-185, Vol. XXV, No. 170. June, 1930. ^ „ 

Kendall, M. G., The Advanced Theory of Slaiisiics, Vo! 1, Citapters 14 an.. 15. \ o, II. 
Chapters 22 and 28, Charles Griffin, London, 1943. 



CHAPTER IS 


Influence of selection of sample 
and accuracy of observations on 
correlation and regression results 


Selection of Sample 

Methods of determining linear and curvilinear regressions, together 
with appropriate measures of their sigDihcance and accura^, have been 
set forth m previous chapters These methods do not yield results 
representative of the universe from which the sample observations have 
been drawn, however, if that sample is not truly representative of the 
particular relation being determined in the universe from which the 
sample is drawn There are various vrays in which the sample may fail 
to represent the universe, and the resulting extent to which the correlation 
constants will be biased will vary both with the character of the 
unrepresentativeness and with the individual coefficients Each type of 
abnormality must therefore be treated separately 

The samples may be selected from the universe in such a way as to 
exclude all the observations falling beyond a certain value of a given 
variable, thus ruling out values either at one or at both extremes, or 
perhaps ruling out middle values and selecting only extreme ones This 
may be done for either the dependent variable or the independent variable 
or variables, or for both together Such a selection of observations 
produces certain specific effects upon the correlation constants Under 
some conditions it may be veiy desirable to select the observations in this 
way, if the resulting aberrations m the correlation constants are recognized 
and allowed for 

A second and somewhat more difficult type of problem to deal with 
arises when there are errors of measurement in obtaimng the values of 
one or more of the variables — such errors as might arise, for example, 
306 



307 


Selection of Sample end Errors In Okservethns 

in c'.timaling the total prodoction of corn in the United States %v}thi‘n a 
gi\cn year, or in worhinc out. from a farmer's niemory, svhtii ''■as the 
income on Itis farm the previous year. Here .acain the effect on the 
correlation constants will depend ttpon sshctlter the errors arc i.indont 
or biased and tipon uhich vari.aWe or variables are atTccted by the errors. 
A separate discussion must thcrcfoic be civen c.ich ease. 

The dearest uay of indicating the effect of these various departures from 
truly representative santpling may be by first slating the genera! principles 
involved, and then illustrating the way those principles work out by 
concrete illustrations. Except where specially stated othensise. the dis- 
cussion will apply solely to linear relations. The effects for curvilinear 
relations are in general analogous to (hose (o bediscussed. The illustrations 
here are purely in arithmetic terms. More general illustrations in algebraic 
terms would yield parallel conclusions. 

Selection of Sample with Respect to V'ofucs of Independent Variable. 
If the sample is selected ssith respect to values of the independent vari-ablc, 
that will not tend to affect the slope of the regression line but will afi'ect 
the value of the coefficient of correlation. If the selection is .such that 
extreme values arc rejected hut intermediate ones arc left in, the correlation 
will be lowered below that prevailing in the univcr.se: if intermediate 
values arc rejected and only extreme ones arc used, the correlation will 
be raised above that prevailing in the universe. If the standard deviations 
of arrays arc equal, the standard error of estimate will tend to remain the 
same, regardless of the selection. 

These principles may be illustrated by the set of hypothetical data 
shown in Table 18.1. 


Table 18.1 

CoRRn.ATiov Table, Snowiso HYPOTKincAi. Freouencifs 
AT Sricinro Valufs 


Values of 1' 



Values of X 



0 

1 

2 

3 

4 

0 


. . . 

1 

I 

1 

1 


I 

2 


1 

2 

I 

2 

4 

2 

1 

3 

I 

2 

2 

1 


4 

I 

1 

1 




For the data shown, r = —0.47. .r. = = 1.134, and /%. = —0.50. 

Jf now the values had been selected witit rcfcrcricc to .V. so as to exclude 



30a Sompljng Sign/pconce of Results 

values below 1 or above 3, the number of observations would have been 
reduced from 28 to 22 For this restncted set of observations, f = —o 26, 
=» 0 739, 5, = I 09, but = — 0 50 Computing the standard error 
of estimate, we arrive at 5, , = 1 02 for the first case and 1 07 for the 
second It is quite apparent that the correlation has been lowered by the 
restriction in the selection of the values of X, but the regression of y 
on X has not been changed at all, and the standard deviation of the residuals 
has been only slightly changed 

If now the selection is such that only extreme values of X are taken, say 
below I and above 3, the number of observations is reduced to 6 Com- 
puting the results for those values, we have f * = 2 00, = I 29, r = 

-0 71, but = -0 50» Also 5, , = 1 00 
Bringing the three sets of results together for companson, we have the 
following tabulation 
With X used as independent variable 



Sg 

Sf 

r„ 

5,* 


AH cases 

I 13 

1 13 

-047 

102 

-050 

Extreme values of X excluded 

074 

1 09 

-0 26 

1 07 

-0 50 

Only extreme values of X used 

200 

129 

-0 71 

100 

-0 50 


In the illustrauon, the observations are distributed very symmetrically 
Were they distributed more irregularly the values of b and 5 might 
vary more widely between the three sets, but still would tend to be 
similar 

These three examples thus illustrate the principles stated before that 
selection with respect to the independent factor does not tend to change the 
regression or the standard deviation of the residuals but does affect the 
correlation, lowering the correlation if it has lowered the dispersion of the 
independent factor and raising the correlation if it has increased the 
dispersion of that factor 

Sefection of Sample with Respect to Values of Dependent Factor. 
Selection with respect to values of the dependent factor is more serious, 
m that It affects all the constants According to whether the effect is to 
raise or to lower the standard deviation of the dependent factor, such 
selection tends to raise or lower both the regression coefficient and the 
coefficient of correlation from the value for the universe and likewise 
to raise or lower, respectively, the standard error of estimate 

These principles may be illustrated from the three examples just used, 
by regarding X as the dependent factor and T as the independent factor 
and noting the influence of the selection with regard to the dependent 
factor X upon the regression of X on T, b„ For the first case, with all 



309 


Selection of Semple end Errors in Onservetsc-^s 

values left in. = — O.SOand , ~ 1‘or the *icct''ndca‘‘'C,}b?v e'er, 
with extreme values of A' left out. /'., drop-, to —0.2? and , beco-r.c-- 
0.73. For the third case, with only extreme \alucs of A' included, 
increases to —1.20 and S, , becomes i.55. Brineina ihe-c three 'cts 
together %'ields the follow inc compariso.n. 

With A' used as depcr.dctu xariable: 








Ail ca.ses 

1.13 

1.13 

-0.47 

1.02 

-0.50 

Extreme values of .Y excluded 

0.74 

1.09 

-0.26 

0.73 

-0.23 

Only extreme values of A' included 

2.00 

1.29 

-0.71 

1.55 

-1.20 


These results indicate the c.xtcnt to which selection with regard to the 
dependent factor may completely destroy the usefulness of all the revuits. 

Selection of Samples with Reference to Values of Both Variables, 
Selection of cases with reference to values of both independent and 
dependent variables ha.s an even greater ctTcct upon the conclusions ih.in 
the two cases already discussed, because selection of extreme xnlues tends 
to exaggerate the correlation and regression, and of central xalues to 
lower both to an even greater extent than vvhere the selection is with 
respect to the dependent factor alone. 

If, in the data of Table 18.1. only those eases are selected in which, values 
of /V below 2 arc associated with values of )' above 2. and in which value.s 
of A' above 2 are associated with values of )' below 2, the observations 
are reduced to 10 eases, as follows: 


Values 

Values 

Number 

Values 

V.alues 

Number 

of A' 

of }' 

of Cases 

of A' 

of }' 

of Caves 

0 

3 

1 


0 

1 

0 

4 

1 

3 

1 


I 

3 


4 

0 

I 

1 

4 

1 

4 

n 

1 

I 

For 

these values. 

4..= 1.48. 

= 1.48. f.j. = . 

-0.90. /V 

= -0.91, 


and = 0.68. 

h is evident that such selection raises both the correlation .and the 
regression above llic true value for the universe. This is to be expected, 
for this selection is equivalent to picking out the p-tirs of values which d:> 
show correlation with each other. Restricting the selection to paired 
values of above ! for both variables, and below 3 for both variables, like- 
wise would be picking out cases .so as to eliminate a!) correlation. Such 
selection obviously destroys the value of the results. 



310 Sampling Signifconce of Results 

Cone/u*/ons with Reference to Se/eetion of Doto. If an investigator 
IS interested only m the regression line and not in the degree of correlation 
and if the regression is truly linear, selection of data with reference to the 
independent factor (or factors) will not tend to change the slope of the 
regression line (or lines) Under those conditions selection of extreme 
cases of the independent factor may yield a reliable indication of the 
regression with many fewer observations than if the cases were selected 
at random This principle is frequently applied m experimental or 
laboratory work, but is equally applicable in other types of investi- 
gations 

If the regressions are curvilinear, however, special selection of either 
extreme or central items of the independent variables forestalls the 
determination of the nature of the function, since curvilinear regressions 
can be determined only for the ranges of the independent factor within 
which observations have been secured For such regressions, therefore, 
the nature of the function may be more accurately determined if the 
independent items are selected so as to be spread fairly umformly through 
the whole range of values, thus affording a sufficient number of observa- 
tions for accurate determination of the nature of the relation throughout 
the whole range Selection purely at random frequently provides more 
observations than are needed for certain portions of the curve, and provides 
so thin a scattering of observations at other portions as to make its true 
position and shape quite indeterminate, as has been illustrated previously 
Even if curvilineanty is only suspected, such a uniform distribution of 
values for the independent variable provides an improved basis for 
dctcrmmmg whether or not the regression is truly linear, as compared 
with an equal number of observations selected at random At the same 
time, if the dependent factor is normally distributed, with its standard 
deviation similar for different arrays, selection with reference to the 
independent factor does not tend to change the standard error of estimate 
(The regression model as discussed in Chapter 17 ) 

If the primary interest, however, is not in the nature of the relations 
and m determining how closely values of the dependent factor may be 
estimated (regressions and standard error), but instead is m determining 
what proportion of the ongmal variation m the dependent factor can be 
accounted for on the basis of the relations determined (correlation and 
determination), then anything other than random selection with reference 
to any factor will give estimates of the closeness of the correlation which 
either over- or underestimate the true correlation m the universe from 
which the sample is drawn For most accurate results in such problems, 
the distribution of the dependent factor in the sample should be an accurate 
representation of the dislnbution in the universe from which the 



Selection of Sample and Errors in Observetwns 3! I 

ob^'Crvations v.crc drav.n. and !he only wicctinn '.vhich 'AOuld be jy'.iiflfd 
would be aimed at securing sruch a sample. (Sec the cerreiatior} nn'Klcb 
as discussed in Chapter 17.) 

Since the correlation coeflicient or index, and the p-iralk! measures of 
determination, arc of significance only with respect to the st.ind.'ird dc\ia- 
lion of the observed saiucs of the dependent factor, ii follous that when 
the dependent factor has such an abnormal distribution that its stand.ud 
deviation is of little value as a dcscriptisc statistic, the measures of correla- 
tion also tend to be of little value. For any series which actually yielded 
such an extreme distribution as the dichotomous values used in the 
Third ease of tho.se just illustrated, measures of correlation would have 
little significance except their formal mathcm.atical definition. Yet the 
regressions and standard errors of estimate would tend to retain all their 
usual value and significance, so long as no selection had been made with 
reference to values of the dependent variable. In such a c.ase. attempting 
to select the values of the dependent factor so as to make the series more 
nearly normal might seriously bias the regression results. 

Accuracy of Observations 

The data w-ith which the statistician has to deal are frequently subject 
to errors of observation. If com yields are being studied in relation to 
fertilizer applications, for example, farmers may be able to estimate the 
yield per acre on a given tract only to within 5 or 10 bu.shcls of the true 
yield. If livestock prices arc being studied, the market reporter may not 
be able to get his daily average nearer than within 10 or 25 cents per 
100 pounds of the true average of all tlic sales for the day. Or if educational 
ratings are being studied, the instructor may not be able to grade the test 
papers nearer than to within 5 or 10 per cent of the grade each really 
de,servcs. All thc.se illustrations arc akin to the difficulties of the surveyor, 
who finds he cannot measure his angles more accurately than witlun a 
certain number of sccond.s: or of the astronomer, who finds his repeated 
observations disagree from each other by fractions of a second. But the 
errors of measurement arc ordinarily tremendously greater in biological, 
economic, or social investigations than in physical ob.servation.s; and for 
that reason statisticians must be particularly careful to use their d.ita in 
such a way as to minimize the influence upon their conclusions ol the 
errors which m.ay be present. 

Errors of obsenation m.iy be .such that they arc not correlated with 
.the value being observed, hence lend to fall equally above and Ivclow the 
true values throughout the range of the variable; or else they may be 
such that ihcv arc correlated with the variable, tending usually to make 



312 Sampling Significance of Results 

the observed value fall above the true value m the upper part of the range 
and below the true value m the lower part, or vice versa 

In correlation problems, there arc two sets of true values involved, 
those for the dependent and independent variables, and there may also 
be two sets of errors, one tending to cause the observed values for the 
dependent variable to differ from the true values, and the other affecting 
the independent variable The extent to which such errors, if present, 
modify or impair the results of correlation analysis depends both upon 
the type of the errors and the vanables which they affect 

If the errors affect only values of the dependent factor, and if they 
are not correlated with the true values, their presence tends to lower 
the correlation and to increase the standard error of estimate, but does 
not tend to change the slope of th^ regression Ime from the true slope 
for the Universe If, however, uncorrelaled errors are in the independent 
factor, that not only tends to lower the correlation and increase the 
standard error of estimate, but also tends to decrease the regression 
below the true value Both of these cases may be illustrated from the 
same set of data used before 

Errors In the Dependent Yanablo. The data used m Table 18 I may 
be modified by assuming some random error influences Y, making 
one-tbird of the values I unit higher, one*third 1 unit lower, and leaving 
one third unchanged With these changes, the data appear as in Table 
18 2 

Table 18.2 


Correlation Table, Showing Hypothetical Frequencies at Specified 
Values, wrm Random Errors in T 


Values of Y 



Values of X 



0 

1 

2 

3 

4 

-1 

0 


1 

I 

2 

I 

1 

1 


3 


1 

2 


2 

2 

3 


3 

1 

2 

3 

1 

1 

4 



1 



5 

1 

1 





For ihese data, = —033, b,g = and = 3 46 The 

introduction of the random error into Y has lowered the correlation 
from that of —0 47 for the ongmal values and mcreased the standard 




Selection of Semple and Errors In ObserYctJoni j/j 

error of estimate; but it has had no significant cfTcet upon the rcrfes-.is'n 
of y on X. the ncv\' value —0.50 being identical with the salue of —0,5^'* 
for the original data in Table IS.l. 

Errors in the Independent Variable. If, however. .V is rccarded ns 
the dependent factor and J as the independent, the rcercssion cociTicicni 
for the new saiucs. = —0.28. is found to be much reduced frorr: that 
of —0.50 for the original values. Introducing even random errors snto 
the observations of tiic indep-endent factor marhcdlv reduces the observed 
regression below the true value. 

The errors considered to this point have all been random errofv. If. 
instead, the errors arc correlated with either of the factors, their presence 
would obscure the true relationship and bias any correlation const.ants 
which might be computed, tending to make them either too hiah or 
too low, depending on the inter-relations between the errors and the 
variables. 

Errors In Bath Variables. If random errors arc associated with both 
variables simultaneously, their effects arc a blending of thovc just illus- 
trated, tending to reduce both the closeness of correlation and ilic 
regression below the true values. For example, if random errors of the 
same magnitude arc introduced into A' as well as 1' of Table 18.1, the 
values appear as in Table 1S,3. 


Table IB.3 


Correlation Tarle. Showtno Hypothetical Freouenhes at SpEniiED 
Values, with Random Errors in Both X and V 


Values 



V'alucs of X 



of y 

-1 

0 

1 2 3 

< 

“t 

5 


-1 

... 


. . . 




0 

... 

1 

1 

1 



1 

I 

1 

1 

1 

I 


2 

1 

2 

2 

1 

1 


3 

I 

2 

2 

1 

1 

1 

4 

... 



1 



5 

{ 

i 






With these changes, the correlation is reduced to practically 0. the 
standard error is increased to 1.524. and the regression of on X is 
changed to —0.179. The comparison of these constants with those for 





314 Sampling Significance of Results 

the onginal data in Table 18 I illustrates the extent to which the presence 
of random errors in the observed values of the variables may reduce the 
accuracy and effectiveness of correlation analysis 

Dealing with Errors in Both Variables The methods of computing 
the regression line considered to this point are methods which take one 
variable as given, or independent, and the other variable as based upon 
It, or dependent If it is known that all the errors of observation are 
random and are in one variable, and none are in the other, the effect of 
those errors may best be eliminated by considering the one with no errors 
as independent and the other as dependent As has just been demonstrated, 
the regression Ime then obtained will be practically identical with that 
which would be obtained if no random errors at all were present 

In some cases it may be known that both variables are equally subject to 
random error with the same vanance, yet it may be desired to obtain a 
regression line which most accurately expresses the relation between the 
two That can be done by a special method, which fits the line on the 
condition that the sum of the squares of the departures of each observation 
perpendicular to the fitted line shall be made a minimum {in contrast to 
the usual condition that the sum of the squares of the lertical departures 
from the fitted line shall be made a minimum, with the dependent variable 
plotted as the ordinate) This special method involves an entirely different 
procedure for fitting the line, and is not given here * Chapter 24 presents 
another and more modem method of dealing with a similar type of 
problem 

Errors of Observation in Mvftiple Correlations. The points which 
have been illustrated here for simple correlation are equally true for 
multiple correlation, both with respect to the influence of selection of 
sample and of the effect of errors of observation The influence of errors 
of observation m multiple correlation problems may be illustrated by a 
case based on actual economic data 

Over the 17 years from 1907 through 1923, the monthly price of lambs 
shows a very high correlation with the price of wool and the price of 
dressed lamb When is used for the pnee of wool, m cents per pound, 
Xg for pnees of dressed lamb, m cents per pound, and Xi for prices of Jive 
lambs, in cents per pound, multiple correlation gives, for the 204 observa- 
tions, 23 = 0 991 and = 0 144a^ + 0 354x3 

To test what effect random errors would have had on this correlation, 
two dice were thrown 204 times, giving random values from 2 to 12 

* Abraham Wald, Fitting of straight lines if both variables are subject to error, 
A?mals of Mathematical Statistics VoL XI No 3, pp 284-299, September, 1940 See 
also Albert Madansky, The fitting of straight lines when both variables are subject to 
error, /oKf Amer Slat Assoc, Vol 54, No 285, pp 173-206, March, 1959 



Sdccticn ofScmph end Errors ir, Obsenfct'ors 3fS 

Thc«e %'alucs ivt-rc ihcn added to the succes^r.e \a!ue<. of the dereadert- 
and a similar set of 204 \a!ucs lo the succc''"'!’. e obsersatio"-!. cf ore 
independent factor, lo sec ^'.hat ciTect that v.t"a!d hasc on the rcM.-Us, 
In Tabic IS. 4 the notation A'-f o is u<^cd to designate the s.'jrsab’es to 
whose values these “random errors’" had been added. 


Tabte 18.4 

nrrrcT or Introducivo Ravoom Ef-p.or.s on Coprii mion R!sii,rs 


Independent 

Variables 

Dependent 

Van.ib’.e 

Multiple 

Correlation 

Rccres(jon rquasicn 

A; and A 3 

A', 

0.991 

0.14ir. * 0.354r., 

A', .and A’a 

A j + c 

0.S2I 

o.tj>. 4- o.4:e--3 

A™ and A'^ -f c 

A', 

0.953 

0.163/-. - 0.:77.-, 

A; and A3 -f e 

A', 4- r 

0.f04 

0.1 52r, -s 0,305-; 


These results illu.stratc the principles just set forth. The introduction 
of random errors into the dependent variable (Aj) reduces the correlation, 
and changes somewhat the size of the two regression coefRcienls. The 
errors in this case may not ha\c been completely randomly distributed 
and uncorrclated with .Vi. AV and X.. even thdugli determined by throw > 
of dice. 

But the second modification, where the error is introduced into the 
independent variable A’^ instead, is much more striking. The correlation 
is not reduced so much as in the first case, and the regression of A\ on A; 
is ciiangcd only slightly from the original va!uc~and increased as it 
happens. The net regression of A'l on A'-. -r e. howeser, is only three- 
fourths as large as was the net regression of A'j on A'^,. in spite of the fact 
that the error introduced was only enough to raise the standard deviauon 
of -Va from 6.14 to 6.64. 

The final case, with errors introduced into both A', and X~. show- the 
lowest correlation of any, as would be c.\pectcd, esen though the net 
rccressions arc not greatly changed. 

Just how great an effect random errors may have upon the re-ulls 
depends upon the magnitude of the errors, the original sanation in the 
variables, and the closeness of the intcr-corrclation. nquations can be 
derived to siiosv hosv great a reduction in correlation errors of a given 
mattnitude will produce, but they are of little practical use in economic 
wort:, since it is usually difficult enough to detennine whether ificrc arc 
errors of observation or not. and much more so lo determine wiuti 



316 Sampling Significance of Results 

magnitude they have * Reports or estimates of prices, or commodity 
production, or supply are nearly always subject to more or less error 
The same is true of many other economic data It may be slightly 
reassuring to know that observational errors even as large as those 
just considered still modify the regression results as little as these have 
been seen to do 

If there is known to be a large but random error in observing some 
variable, that variable may still be used as the dependent vanable m a 
correlation study without making the regressions or estimating equation 
very far wrong, if determined with a large number of cases, but, on the 
other hand, any use of that vanable as an independent variable will be 
certain to yield results which understate the actual relations 

Biased errors tend to make the results more or less in error, regardless 
of the vanables to which they apply If the errors tend either to magnify 
or to minimize the differences which actually exist, they will have a 
parallel effect on the regression coefficients if they apply to the dependent 
variables and an inverse effect if they apply to an independent variable 
There are so many different types of bias, however, that no more definite 
statement of the effects can be laid down 

Random errors have the same type of effect in curvilinear correlation 
that they do m linear correlation, since if they are truly random they will 
tend to be balanced out along all the portions of the regression curve 
alike if in the dependent vanable, or they will tend to confuse the relations 
along the curve if m the independent variable, and so they reduce the 
differences observed About the only real difference between linearity 
and curvilineanty with regard to errors is that random errors in the 
dependent variable could be ‘‘balanced out” m the case of a straight-Ime 
regression with a somewhat smaller number of observations than would be 
necessary to secure vahd results for a curvilinear regression 

Where, with random errors in the dependent factor, there are not 
enough cases available to “balance them out,” the effect of the errors is to 


* In the problem given, the significant values dctcrroming the effect of the errors are 
s, = 3 96 su, = 4 74 
= 6 14 != 6 64 


If the errors are m the dependent vanable alone the relations between the true and the 
apparent correlations are indicated the equation 


.v; 


" 1 + (.silsl) 


This gives what the new correlation would be if the errors were truly random, so that 
the new regression equation came out identical with the old In the problem given, this 
gives an expected value for ^ of 0 SZ7 as compared to the 0 821 actually obtained 



Selection of Sample end Errors In Obiervctions 3/7 

throw a varying amount of error into the conclusion?, the ciacS ansount of 
the error depending on how closely the errors approach beine canceled 
out. The illustrative case, wherewith over 2f>0oh<eiwa lion? the regressiort? 
were still cliangcd somewhat, probably indic.itcs wlmt may be obtained 
by a combination of slight departures from true ‘'randomness" in the 
errors with a sample not quite large enough to eliminate entirely all the 
resulting instability. TJiis may be nearer to what would usually happen in 
practice than the theoretically complete elimination of the errors in the 
dependent variable shown in the symmetrical arithmetic examples. 

Summary 

Modification of the observations from the true conditions, either by 
selection of the sample or by the presence of errors of observation, tends 
to alter the value of the cocfiicient of correlation. If ilie regression line or 
curve is of primary interest, however, its accuracy of determination may be 
increased by suitable selection of obscrx'ations with respect to independent 
factors. The standard error of estimate likewise is little affected by 
selection with respect to the independent variables. Similarly, random 
errors of observation may not influence the regressions, if tlie factor 
they affect can be treated as the dependent factor and if enough observa- 
tions are available to balance out the errors. These points hold true for 
multiple correlation problems as well as for Iwo-variablc problems. 



CHAPTER 19 


Estimating the reliability 
of an individual forecast 


Chapter 17 has indicated the kind of vanabihty from sample to sample 
that may be expected m detcrraifung statistical constants, such as regression 
and correlation coefficients, and m fitting regression lines and curves It 
has provided means of estimating, from the values obtained from a single 
sample, various indications of how far and how frequently the statistics 
from successive samples of the same size are likely to vary from the true 
values of the parameters in the universe from which the samples ate 
drawn 

Reliability of an Individual Forecast 

The practical statistician frequently has to deal with a quite different 
problem Havmg taken a given sample, and having determined from that 
sample how the selected dependent vanable is related to one or more 
independent vanables, he then has the problem of drawing new observa- 
tions of the same independent vanablc(s) from the same umverse, and of 
estimating from those new values the most probable value of the dependent 
variable for the new cases The standard error applicable to such an 
estimate is called the standard error of forecast In ordinary usage, 
“forecast” suggests something m the future, like tomorrow’s weather, or 
next year’s com yield However, statisticians also use this term in 
connection with new observations from umverses in which time is not a 
source of uncertainty 

For example, in a sample of children drawn at random from the school 
population of a given city, certain relations may be determined between 
their age and height and their weight From these relations, how closely 
can we expect to estimate the weight of a new child, selected at random 

31Z 



3f9 


Reliability of on In^Mdua! Fcrcccst 

from the same population, once %\c imow its ac; and heirht ? In prob’em* 
such as tills, tsc arc concerned with the pos<ibIe difference bets'. cen the 
estimated value A| and the actual value A’j. for nev, nhseri-ations drassn 
from the same universe as the sample. We have calculated stand.ird 
errors for the rccression coefficient and line, the ■standard errtm in 
estimating A' or A", in the sample under study, and adjustments of the 
standard error of estimate tor “dccrccv of freedom" to obtain unbiased 
estimates of the variation about the true regression line in the parent 
universe. The present problem, howeser. involves the accuracy of 
estimates made from the line or curve obtained from the samnlf, in the 
light of the possible sampling errors of that line, as compared to the true 
line, plus the possible ranee of errors of the estimates around the true 
line. What vve need, therefore, is a means of combining the standard 
error of the regression line vviih the .standard error of cstim.aie «• 

Simple Regression. For a simple two-variable regression, the squ.irc 
of the standard error of a single estimate is given bv the equation'' 

4 - - 1- = -b (-V,/)- + f 1 9. 1 ) 

Applying this equation to the illustration used previously, on page 2S8, 
we can tabulate the calculation of various values as in Table 19,1. Column 
(3), values of jj.. is taken from T.iblc 17.4, next to last column, since 

4 = 4r^ + (\f)-- 


Table 19.1 


Selected 
Values 
of A' 

(1) 

Departures 
from 
Mean, r 
(2) 


Calculation of _ ,, 


13) 

(4) 

s* . ^ 

13) i (4) 

(5) 


0.97 

-1.00 

14.0650 

6S.62 

R2.6'?50 

9.07 

1.47 

-0.50 

7.1793 

6S.62 

75.7993 

8.71 

1.97 

0 

4.?.sai 

6S.62 

73.5041 

.V.57 

2.47 

0.50 

7.1793 

6S.C2 

75.7993 

S.7I 

2.97 

1.00 

14.0650 

6R.62 

f2.6?50 

9.0-? 

3.47 

1.50 

25.5411 

6S.62 

94.1610 

9.70 

3.97 

2,00 

41,6077 

6^.61 

110.1977 

10.50 


Tlic last column gives the standard errors of forecast for values of 
}*' estimated from new values of X drawn from the same universe. It is 

* The derivation of this equation iv gb’tn in Note 5, Appendis 3. 



320 Sampling Significance of Results 

apparent from these values that standard errors for individual forecasts 
near the mean of X are but little larger than Sy^ Thus the standard error 

for the forecast of 22 3 for Y' when ^ = 1 47 is only = 8 71, as 

compared with Sy^ = 8 28 The further the observed value of X departs 
from the mean, the larger the uncertainty of the individual forecast 
Thus when X = 3 97, Sy-.y — 10 50 We can state this uncertainty of the 
estimate more simply by expressing the relation as follows 

When Z = 147, F « 22 3 ± 8 71 

When X=397, F « 64 0 ± 10 50 

Here we have introduced a new symbol, Y, to designate the probable 
range within which the true value will lie, for 2 estimates out of 3 on the 
average 

These standard errors of individual forecasts are interpreted m the 
same way as any other standard error, as indicating (for various selected 
multiples of the standard error) the proportion of a succession of such 
forecasts which can be expected to show departures from the true values of 
stated sizes for any specified degree of confidence Thus, m the problem 
illustrated on pages 139 and 289, when yields are estimated for new plots 
with 3 97 feet of water applied 2 out of 3 new observations, on the 
average, should show yields falling within 10 5 ten pound units of the 
estimated yield Table 2 3 should be used m calculating confidence 
intervals from this standard error, m the same way as before 

A1u/t/p/fi Regression. The equation for the standard error of an 
individual forecast made from a multiple regression equation is similar 
to that given for simple correlation, with the addition of expressions for 
the additional variables, as follows 

I + - + CssP^ + Cs3*f 
* * L n 

"h “b ■f' 't' 20343^*^^ (19 2 ) 

In this equation xz Zg and x^ are the values of the indej3endent variables 
from which the forecast is made, stated as departures from the respective 
means Mg Mg and M 4 as calculated m the original sample from which 
the regression equation was calculated The c values for equation (19 2) 
are obtained by the simultaneous solution of the following equations 

( 24)^22 + + (2:xgzjcgi = 1 1 

(^IxgX^Cgg + (SaDca + (23:3X4)031 = 0 I (19 3 ) 

(2x33:4)032 + -V {Sx|)c 24 = 0 J 



32f 


Reliability of an Individual Forecast 



-f 


»0| 


(ZT^^)e^ + (2x§)e~, 

-r 



(19.4) 

(i^x«3r^)r2- -)* 

+ 


= oi 


(S3f)c<; -f {Zryr^)c^.^ 

~u 


= 0] 
t 


(X>zyrj)c^2 + (-33)c„ 

+ 

(— jj 

= 0 j 

fl9,5) 

-f (2l2-3r,)c„ 

a. 

« 


-0 


Solving these equations, c-t = Cj,. 

C;{ •!= r... 

and r-, = 

= r,,. The 


coefficients of the equations to the left of the equality signs arc identica! in 
all three sets of equations and arc also identical with those of the norn-.al 
equations (11.9), used in determining the net regression coefikicnlx. 
This makes it possible to compute the values for the c’s at the same time as 
for the b's, with only slight additional work. (See Appendix 2. p.a!!cs d99 
to 502. and 522 to 524)." 

The c’s for a large number of independent variables arc obtained by an 
expansion of equations (19.3) to (19.5). selling up as many sets of 
simultaneous solutions as there arc independent variables and placing the 
1 on the right-hand side of the equations opposite the variable whose 
(Hz~) occurs with the c„ri’s> just .as for the second set of equations (19.4) 
above, 1 occurs to the right of the equation where occurs as one 

of the items on the left of the equality sign. 

The standard error of the individual forecast will differ for each 
combination of values of the various independent variables. If these all 
fall about their means, will be only slightly larger than S, If one 
or more fall far from it, the standard error of the forecast will be corre- 
spondingly large. 

For n variables, the general formula for the square of the standard 
error of the individual estimate is given symbolically by 




(19.6) 


Jn expanding equation (19.6) for any number of variables, it must he 
interpreted by the special condition that r-Cj = f--, — r,..,. etc. 

The standard errors of indis'idual estimates made from multiple 
regression equations, according to equations (19.2) or (19.6). can be 
interpreted in the same way as those from simple regression equ.ations. 

Curvilinear Regression. Where a simple or multiple curvilinear 
relation is determined by fitting mathematical regression equations, the 
standard error of individu.al estimates can be computed by an extension 
of equation (19.2). Thus if a cubic parabola has been fitted using 

}' = o -f b.X -f 



322 Sampling Significance of Results 

we can compute this equation most readily by writing it m the form 

Y ~ o + ^2^' + + b^V 

where U = and V= 

The standard error of an individual estimate is then given by the 
equation 

4ru.-ll " n ^ 

+ 2Cxa^ + 2c^^XV + 

Similar expansions are available for mathematical regression equations 
for two or more variables * 

The Applicability of a Regression Equation to an Extrapolation 
beyond the Observed Range 

We have already seen examples, in Chapters 14 and 16, of how estimates 
might sometimes need to be made for new observations which he beyond 
the range included m the onginal sample We have also seen the possibility 
of exceptionally large errors of estimate when the formulas or curves are 
extrapolated in this way beyond the observed range A rough rule*of* 
thumb has been given that estimates beyond the observed range should 
never be made, or, if they must be made, should be regarded as exception* 
ally hazardous This present section will explore further the meaning of 
the statement “beyond the range of observation ’’ 

Where only two variables are concerned, there is no question as to 
the range covered in the original observations Thus if we consider the 
data plotted in Figure 8 2, it is apparent at once that the independent 
vanable X covers the range from 1 2 to 3 5 Any new values of X smaller 
or larger than those values would be beyond the observed range 
Where two or more independent vanables are concerned, the situation 
IS more complex Thus the data of the example plotted m Figures 10 4 
and 10 5 show that the acres range from 60 to 240, and the cows range 
from 0 to 18 Suppose a new observation were drawn from the same 
universe, with 225 acres and 17 cows Would that observation be within 
the original range At first it might seem that it would, since the number 
of acres falls withm the original acreage range, and the number of cows 
within the original range for cows 

* Henry Schultz The standard error of a forecast from a curve, Journal of the 
American Statistical Association, pp 139-185, Vol XXV, June 1930 



Reliability cf an IndMdual forecast 323 

Multiple regression, howeser, is concerned not merely v dh th: relation 
of the dependent variable to each independent v.anablc separately, b-ji 
svith the composite relation to all the independent s.ari.ab'cs toitcltter. Is 
the comhinathr, of 17 cows and 225 acres either evactlv or approsimatclv 
swthin the joint distribution of the original ob'cn-ations? The join! v.iluc-. 
for A® and A,, svhich were represented in the oricinal obscr. .ations. arc 
shown plotted on Figure 10.6. From this it is evident that the ness 
combination lies well outside the obsersed joint distribution. 

The original sample had some farms of betssecn 200 and 250 acres, 
but none of them had more than 6 cows. It ako had some farms of IS or 
more cows, but none of them had more than 120 acres. The .sinnlc orieinal 
case Uiat came anj-wherc near the new obsetwation was a farm with 14 
cows and 180 acres. Since the new observation lies well outside the Jt'in! 
distribution or combination of values represented in the original s.amplc, 
any estimate made for it from a regression equation based on that s,amp!c 
is subject <o a hazard beyond that given by the error formulas discussed 
earlier in this chapter. Those formulas give accurate values ofthc probable 
error of individual estimates only within the range represented by the 
original sample. Extrapolation of the regression equation or cmvc.s 
beyond that range, or combination of values, represents an extension into 
unknow'n fields, where sudden changes in the nature of the relations might 
conceivably occur. A priori knowledge of the relations, based on technical 
facts and theories, or on other evidence, may justify extrapolations of 
the curx’cs. Estimates of error for such extrapolations are only as reliable 
as the assumptions on which the extrapolations arc based. 

Where there arc three or more independent xariables, it is still more 
difficult to determine xvhether a given new combination of values lies 
outside the joint distribution of the three or more variables in the original 
sample. In many eases this can be determined by cafcful checking of 
the new obsen'alion against dot charts of the correlations among the 
independent variables. Thus, suppose a new observation were drawn with 
2 cows, 100 acres, and 4 men. Would this be vviihin the range of the 
original observ'ations? 

Careful inspection of the observations tabulated in Table i 1.3. on page 
177, reveals that, although the combination of 2 cows and 100 acres 
is well within the obsened joint distribution for those two variables, no 
such combination occurred with 4 men. or even vvith 3 men. The nearest 
values are one obsenation (No. 7) of 3 men with 6 cows and 170 acres 
and one other (No. 12) of 3 men with 15 cows and 120 acres. The new 
observation, of 4 men vvith 2 cows and 100 acres, would apparently 
involve much more human labor, to care for that many covvs and acres, 
than w^s represented in the original observations, and therefore ties far 



324 Sompfmg S/gn/ficance of Results 

outside the joint distnbution represented m the sample for the three 

variables It is quite possiblethatthatmuchlaborwouldrepresentawasteful 
use, so that the additional men would be as likely to reduce the farm 
income as to increase it An estimate of income for this new farm, based 
on the relations shown m the sample for quite different farms, might 
therefore be very sadly in error 

The rough process of comparing the new observation with the values of 
the independent variable for the onginal observations as illustrated above, 
may serve reasonably well for determining whether the new observation 
IS or IS not represented in the original sample Mathematical methods are 
available for estimating the probability that an observation drawn at 
random from the universe represented by the original sample will deviate 
from the “central tendency” of its joint distribution farther than any 
specified combination of values of the independent variables (note article 
by Waugh and Been) Carrying through such calculations ordinarily 
would involve an amount of labor out of proportion to the value of the 
information obtained For very exact work, or for estimates of very 
great importance, however, it might be worth working them out This 
would be true especially where the new observation happened to fall at 
about the edge of the distnbution zone of the previous observations, so 
that It was uncertain whether or not it would be safe to estimate the 
dependent variable from the relations previously observed 
When forecasts are being made from regression equations for tune 
senes, even more complicated issues are involved These are considered 
in Chapter 20 

REFERENCES 

Waugh, Frederick V , and Richard O Been, Some observations about the validity of 
multiple regressions Stat Jour College of the City of New York,\o\ I, No 1, 
pp 6-14 January, 1939 

Schultz, Henry The standard error of a forecast from a curve, /our Amer Slat Assoc, 
pp 139 185, No 170 Vol XXV. June, 1930 
Arraore, Sidney J , and Edgar L Burtis, Factors affecting consumption of fats and oils 
other than butter in the United States, U S D^t ofA^r,>^r Econ Res Vol-JJ. 
No l,pp 7-9, January, 1950 



CHAPTER 20 


The use of error formulas 
with time series 


This chapter is concerned with universes which are extended over time 
and in which both the original observations and subsequent new onc< 
arc measured at successive intcrsals of time. The corn yield example of 
Chapter 14 and the steel cost example of Chapter 16 both used timc* 5 crics 
data. 

Many problems important in economics and other social sciences 
involve measurements in lime. I3ut time-series problems also arise in the 
biological and physical sciences. In recent years .statisticians have 
recognized the similarities between timc-scrics problems in the various 
disciplines and have made considerable progress toward their solution. 

Differences Between Time-Series and Other Types of Data. The 
attitudes of research workers toward regression analysis of time scries 
have varied between widely separated extremes. In the early and middle 
1920’s many researchers were completely unaware of problems connected 
with the sampling significance of time scries. Then, under the (partly 
misinterpreted) influence of articles such as Yule’s on ''nonsense-corre- 
lations,” it became fashionable to maintain that error formulas simply 
did not apply to time series.^ There was some implication that reputable 
statisticians should leave time scries alone. But in some fields a large 
amount of data already existed in the form of lime scries; and the 
variables so recorded were frequently important in the theories of the 

* Yule himself did not adopt a wholly negative attitude, but spelled out crn.-.ir. 
situations in which correlations between time series would be highly ntislrcding. Tt esc 
did not preclude the existence of other situations in which coTcl.-uioas between time 
series would has-c the usual sampling 5ignif.cante. 

G. U. Yule, ^^’hy do sse sontetimes get no.nscnse-c.'rrel.rttons betwce.n time-series?— .A 
study in sampling and the nature of time sc-rics, Joums! of tht /toys? Stad'zkc! S-micty, 
Vol. S9, No. I. pp. 1-64, 1926. 



326 Sampling Signlficana of Reju/tj 

respective fields Further, controlled experiments involving these variables 
would be impossible, or expensive and time-consuming Dunng the 
I930’s, therefore, some research workers continued to apply regression 
methods to time senes but with considerable trepidation, the previous 
editions of this book defended the practice, partly on intuitive grounds 
All the error formulas presented in Chapters 17 and 19 assume that the 
sample has been drawn from a clearly defined universe If the observations 
are completely random drawings from a bivariate or multivariate normal 
universe (the “correlation model”), all of the simple, partial, and multiple 
correlation coefficients the simple and partial regression coefficients, and 
the standard error of estimate, calculated from the observations may be 
regarded as estimates of corresponding parameters of that universe The 
regression and correlation coefficients from successive random samples 
would be distributed according to the error formulas given in Chapter 17 
If the set of observations results from a controlled experiment, error 
formulas apply strictly only to a universe in which the distribution of the 
values of the independent variables remain fixed in successive samples 
(the “regression model ’) Values of the dependent variables for each 
given combination of values of the independents are distributed randomly 
in successive samples about an “expected” or universe value — a point on 
the universe regression curve or surface In Chapter 1 8 it was demonstrated 
that the correlation coefficient is strongly affected by purposeful selection 
of observations with respect to values of the independent variable It can 
be shown readily that error formulas are also affected by such selection 
For from equation (17 1) 



If we select extreme values of x (which are deviations about a sample 
mean), increases and jj therefore decreases If we select a narrow range 
of values of x, decreases and therefore increases, i e , the sample 
regression coefficient becomes a less reliable estimate of the universe 
value (We assume here a universe in which y *= ^x + s, where z is 
randomly distributed with zero mean and variance The 

adjusted standard error of estimate in any sample should then be an 
unbiased estimate of tr, regardless of the manner in which x values are 
selected, similarly, the regression coefficient from any sample should be 
an unbiased estimate of /? ) 

The concern of applied research workers about time-series analysis m 
the early 1930’s was based upon departures from the “correlation model” 
only This model assumes that each observation in a sample is selected 
* In this chapter p is used in the sense of the true universe value of b 



For Time Series 


327 


purely at random from all the items in the oricir.al universe. <0 that a 
below-average value is just as likely to he foUoued by a high on; av bv 
another lo’v one. If the successive months or years in a tins; series ztc 
regarded as successive observations, this assumption obsioii^lv nnav no; 
hold true. For example, each successive item of a linear trend line is 
perfectly correlated with each preceding item. Each price of a nisca 
commodity on succeeding days or months may' show some relationship to 
prices in the preceding period. 

If the correlation between each item of a series and the item of the same 
scries next following it in time is calculated by the usual methods, the 
resulting correlation coefneient is te.rmcd the coefficient of auiororrchthn. 
Many time scries show' autocorrelations that differ signifieantlv from 
zero, meaning that the basic conditions of simple samplinc liavc prohablv 
not been met. This situation casts doubt upon the stability ofcorrcbiion 
coefheients between two such scries from one period 10 another, and hence 
upon the applicability of standard errors of correlation, coefficients. 

In addition to this technical fact (which applies to some time series 
but not to others), there was a broad philosophical objection to the idea 
that any sequence of time-series observations was in fact a sample from 
a definite, unchanging universe. When any phenomenon is s.ampled at 
successive intervals of time the universe being studied can never be 
precisely the same. Successive astronomical obrersations might differ, 
even "if in imperceptible degrees, because of the loss of matter through 
radiation of energy from the various stars. Surscying measurements in 
successive years might differ because of slight geological .shifting of the 
earth’s surface, or because of erosion or other changes in the soil surface. 
Normal crop yields change because of improsements in the genetic 
make-up of the seed so that what would be normal yields for certain 
weather at one time become subnormal yi-elds at another. Human 
populations, too, change constantly from dietary, genette, and other 
causes so that the average relationship between height and weight may 
change with time. Industrial techniques change, tending to modify 
relationships, as in the steel-cost example; so do the wants of consumers 
and the types and qualities of goods. 

Some of the possible changes in universes mentioned above will be 
trivial for most practical purposes. Among these arc changes in astronom- 
ical obserx'ations and surveying mcasurcmeni-s, where the real changes in 
the universe over a limited span of years might be small relative to the 
precision with which our instruments are able to measure them, and 
completely ncglieiblc in their effects upon regression coefneients between 
different variables in the universes. In gcncr.1l, universes subject to 
human intervention change more rapidly* than those completely d-ependent 



328 Sampling Signipcance of Results 

upon natural processes Even in universes involving human behavior or 
purposeful intervention, our knowledge of the subject matter involved 
may enable us to set reasonable limits upon the probable extent of changes 
in the universe over a given span of years 

To the extent that changes in the umvcrse follow a steady rate of 
progression, they can sometimes be allowed for by trend factors (as shown 
m Chapter 14), by seasonal factors, simultaneously determined, or by 
progressive shifts in the regressions themselves In such cases, forecasts 
of future changes depend upon a continuation of the same rate or degree 
of change Where human intervention is possible, however, one can 
never be completely sure that a new event may not make a sudden change 
or break in the trend 

The sampling significance of regression analyses based on time senes 
was clanfied by Koopmans, Wold, and others in the late I930’s * Clanfi- 
cation was achieved by shifting emphasis from the correlation to the 
regression model Suppose that two vanables, such as the supply and 
pnce of beef, are logically related by the equation P = a + /SS’ + z, 
where z reflects the random influences of other economic factors Then 
if for any reason the supply of beef follows a cyclical pattern over tune, 
the price of beef will trace out a similar cycle with the relative amplitude 
§S Neither the supply nor the pnce observations will be random over 
time But we can regard the successive values of S m the same light in 
which we regard the values of any independent variable m a controlled 
experiment For example, we could have selected our samples of wheat in 
Chapter 6 in a specified “time” order, starting with those having high 
percentages of vitreous kernels and working down to those having low 
percentages of vitreous kernels The percentages of such kernels in 
successive samples would show very high autocorrelation over “time,” but 
this would not shake our confidence in the resulting estimates of the 
relationship between protein content and vitreous kernels 

The introduction of time into the wheat-protein example seems artificial 
and irrelevant, whereas the lime sequence of observations on beef supphes 
seems natural and inescapable However, if the parameters a and P 
remain constant over the entire penod, the ordering of 5 over time is 
also irrelevant from a statistical viewpoint The only requirement is that 
the residuals z be random with respect to time If all of our observations 
are drawn from a universe defined by the equation P = a -b ^5 z, 
time, as Wold puts it, “plays the secondary role of a passive medium 

• Tjallmg C Koopmans Linear R^essiwt Analysis of Economic Time Senes, 
Haarlem, De erven, F Bohn, nv, ISOpp , 1937 

* Herman Wold, A Study in the Analysis of Stationary Time Senes, p 1, Almqvist 
and Wiksells, Uppsala, 1938 



For Timo Series 


m 


TJiis would be true for equations m three or more variables if the p-artiai 
recrcssion lines or curses in the universe rematned constant oscr th*^ 
specified period. 

The above argument does not. of course. 010.10 tliat a novice in anv 
field can go ahead and discover nesv truths by haph.azardiv rccrcsvinc; 
lime series on one another. Instead, the researcher must consider carefuSK' 
whether the series in question arc logicaiJy related, and whether these 
logical relationships apply also to the trend's and scasonals (if anv); also, 
he must apply tests of autocorrelation to rite residuals from the regression 
equation fitted. 

The approach to sampling errors in terms of the rcercssicn model 
should dispel any mystical dread of time series as “so.mehow different” 
and substitute the more manageable questions: (!) Did the universe 
from which these observations were drawn remain sufficiently stable over 
the period to which they pertain; and (2) Ls the universe nov.- suOicientiy 
like that of the previous period that regression relationships based on that 
period still apply? The latter question is equally relevant to data fro.m 
sample surveys or from controlled experiments. 

For example, suppose we had drawn, in 1890. a random sample of 
adult males in the United States and determined their average height, 
average weight, and the regression of weight upon height. Inferences 
from the sample relationships to those in the universe vould apply 
strictly only to the population in 1890. Houever, applications based on 
the 1890 sample would doubtless be made to the actual population 
existing in 1891 and later years — perhaps by clothing manufacturers 
who wanted to estimate appropriate proportions of the different sizes of 
ready-made garments. Assuming that a fair amount of time and expense 
were involved in repealing the 1890 sample, the decision as to how soon a 
new sample was needed would be a matter for profe-sional judgment— the 
judgment of human biologists as to the probable extent of the effects of 
immigration of different groups since 1890. the effects of changes in diet, 
and perhaps the effects of changing attitudes toward obesity. The c.xpensc 
of rcchecking the relationships for 1895 or 1900 would presumably be 
weighed against the possible scientific and/or economic benefits to be 
gained from any changes shown by the new analysis. 

Relationships estimated from controlled experiments may al.so lose 
their rcicv'ancc with the passage of time. Let us suppose tlial the relation- 
ship between cotton yields and irrigation water applied (Chapter 8) had 
been determined by an elaborate controlled experiment which sufficed 
to determine the regression relationship as of 1927 with a high degree of 
accuracy. Let us say that this experiment was applied to a particular 
strain of cotton and that no fertilizers were used. If the tx'pica! farming 



330 Sampling Significance of Results 

practices in 1927 were to use that particular strain of cotton and no 
fertilizer, the results of this experiment would be immediately useful to 
farmers m 1928 But as the varieties of seed m common use changed and 
levels of fertilizer use increased, agronomists m the area would have to 
decide on a judgment basis whether the 1927 expenmental results were 
still applicable, and the direction and extent to which they should be 
modified in determining optimum irrigation rates Sooner or later other 
controlled experiments would be needed to determine the net relation 
between cotton yields and apphcations of irrigation water with new cotton 
varieties, more fertilizer, and other changes in cultural practices Perhaps 
jomt functions would be needed, takmg water and fertilizer apphcations 
into account simultaneously with the new varieties 
Hence, in any area subject directly or indirectly to human influence, 
no method of selectmg observations will guarantee that the relations 
estimated in one time penod will apply to the corresponding universes 
at some later date If controlled experunents must be repeated from tunc 
to time as the relevant umversc changes, it is reasonable to assume that 
time-senes analyses also should be repeated occasionally using additional 
observations or, when sufficient time has elapsed, completely new observa- 
tions If the actual com yield in 1957 is 20 bushels higher than that 
estimated from the 1890-1927 relationship, the discrepancy should not 
be charged to the fact that the relationship had been estimated from tune 
senes, but rather to the fact that the whole set of observations is out of date 
There are vanous ways in which tune-senes analyses can be tested 
and brought up to date As new observations accumulate, we can use 
statistical tests to determine whether the residuals arc “out of hne” with 
the earlier relationship For example, in extrapolating the corn-yield 
equation from 1928 through 1939 (Table 20 1) the sequence of five 
consecutive positive residuals from 1930 through 1934 would have 
furnished some grounds for suspecting that the estimates were throwmg 
“significantly” below the new level of actual yields If we use the analogy 
of tossing a com, regardmg a positive residual as heads and a negative 
residual as tails and remerabenng that the residuals should be randomly 
distnbuted over time if the regression equation is to remain apphcable, 
then the probability of getting five successive positive residuals might be 
estimated® at (J)®, which equals or 0 03125 By 1934, then, we might 
have said that there was less than 1 chance m 20 that the former relation- 
ship was still completely applicable The 1935 residual, agam positive, 

* The statements m this paragraph ate illustrative rather than rigorous For an 
introduction to more precise methods of estunatuig the probability of “runs” or 
sequences of different lengths see A M Mooct,The distribution theory of runs, 
o/ Mathematical Statistics, VoL XI, No 4, especially pp 367-368, December, 1940 



for Timt Ssr/cs 


nt 


would have reduced the odd< to about I in 5‘) or ic^v Ra: tr; darr.'r,- 
done by adhering lo estimates from the l>-90-i927 jc'.at’cn'-mp wouVj 
still have been slighu as only one of the^c six roritive dr. ichor.'; wrr 
larger than the standard error of estimate and this ore did nor exceed two 
standard errors of estimate. The standard error of forecast, w’vch ri the 
theoretically appropriate basis for comparison, would g‘r,e a stil! Itrcrr 
error zone and lead to a still more tolerant appraisal of the impor..arcc of 
the obscrx'cd deviations. 


Table 20.1 

Corn' Yields Estimated erom 1 £90-1 927 Rigresmon' RrcA-nGv-”-’''; 
FOR the Yfars 192S TO 1939 




Com Yields 


Year 

Estimated,* 

Actual, 

Residual. 


V' 

Aj 

.V, - .v; 

1928 

33.1 

33.- 

0.3 

1929 

33.7 

31.5 


1930 

24.3 

25.S 

1.5 

1931 

31.7 

32^7 

1.0 

1932 

32,7 

35.4 

2.7 

1933 

24.0 

29.4 

5.4 

1934 

17.4 

18.9 

1.5 

1935 

30.6 

31.7 

l.l 

1936 

10.6 

18.5 


1937 

30.6 

36.4 

5.S 

1938 

31.4 

36.5 

5.1 

1939 

31.9 

41.3 

9.4 


* Based on resression curses sho'-sn in Figure 14 .K). 



332 Sampling Significant of Results 

of obtaining a single residual as large as that of 1936 from the 1890-1927 
universe of residuals is less than 0 01 At this point, the evidence would be 
sufficient to justify reworking the entire analysis from 1890 through 1936, 
and perhaps checking it by means ofaseparate analysis for (say) 1910-1936 
The principal adjustment as of 1936 would probably be an increase in the 
level of the time trend for recent years and an upward slope beginning in 
the early or middle t930’s This would contrast with the slight downward 
slope of the time trend m Figure 14 10 dunng the 1920-1927 period 

Of course, the reliability of the new analysis would be lirmted by the 
standard errors of the constants of the regression curves just as was the 
old one The standard error of estimate, 2 8 bushel^, indicates that at 
best we cannot hope to forecast yields on the basis of rainfall, temperature, 
and trend within less than 2 8 bushels in 2 years out of 3 and that occasion- 
ally we must expect our estimates to be off by 5 bushels or more even with 
no change in the basic relationships “ 

Presumably, a research worker interested m the corn-yield problem 
would also be familiar with the results of controlled experiments on new 
varieties, effects of fertilizer, and any other scientifically based information 
that might suggest whether, why, and m what direction the net regression 
curves and the residual trend in yield might be changing This new 
information could be converted into new hypotheses as to the shapes of 
the net regression curves and time trend, which could then be tested 
against the data for recent years In the latter part of Chapter 14 we 
discussed such information as an explanation of the trend m com yields 
from 1928 through 1956 

Thus, the analysis of time scries as a research tool is not distinctly 
different from the use of controlled experiments or random samples from 
a universe existing at some particular point m time Even controlled 
experiments can be abused and misinterpreted by inept workers, and some 
aspects of the relevant universes may not be subject to experimental 
control by even the best scientists The design of an experiment must 
take into account the probable sources of disturbances which might 
interfere with measuring the relationships involved Otherwise, the 
estimated relationshij) between Y and X majy be biased by the fact that 
another variable, Z, just happened to be associated with the values of X m 
the expenment 

For example, in an expenment to determine the effects of different 
levels of fertilizer application upon com yields, an inept experimenter 
might design the study m such a way that the larger quantities of fertilizer 
happened to be applied to plots with higher than average basic soil 

* As pointed out in Chapter 19 the error zone applicable to an individual forecast is 
somewhat larger than the standard error of estimate 



For Time Series 


333 

Tcrtiiity or vshicK Kud Ctirricd over Isr^cr rcsidusl <juiinuticv of ch.cn^j^l 
Tcrtilizcrs Trorn experiments mnde in iHc previous yenr. iin iHustmijon 
of the problems encountered by top scientists, the detipners of theciaborAtr 
fertilizer experiments mentioned in Chapter 23 made the fonossine 
comment: 

The predictions apply to particular soils in a particular year; production 
surfaces obtained under other rainfall and soil conditions can he cspccted to 
differ from those obtained in the experiments reponed Traditional experi- 

mental procedures (svhercin a few rates of one or more nutrients arc applicdl 
also refer to the rainfall, climatic, insect and crop conditions of the panieular 
ycarT 

Some economic analyses have pros'cd inadequate because technical 
problems, such as the cfTects of intercorrclalion and the meaning of 
autocorrelation in the residuals, were not recognized. Economic statisti- 
cians could be excused for making such mistakes in the I920's and 1930*s; 
in the 1960’s they can reasonably be expected to take these problems into 
account. 


The Nature of “Randomness” in the Residuals from Time-Series 
Regressions. Before discussing practical tests of autocorrelation, we 
should perhaps say a few words more about the reasons for expecting and 
requiring random residuals in connection with a “succcssfur’ time-series 
regression analysis. 

In part these reasons arc the same as for regression analysis in general, 
which have been discussed implicitly and explicitly in a number of 
connections from Chapter 4 on. We arc interested in estimating the 
“true” or “universe” relationship between a dependent variable ( >') and 
one or more independent variables. The basic assumption in lcast-squ,^rc^ 
regression analysis is that a systematic rclationslnp of the following type 
exists in the universe: For every possible combination of values of tlie 

independent variables, there exists an expected value il\. 17 .Vj,. A'-j .V.nil 

W'liich may be regarded as the arithmetic mean of nil possible values of >' 
given that particular set (in our notation the (th set) of values of the 
independent variables. The individual values of Y for the given combi- 
nation of A'’s\viII show only random dexiations about the expected value. 
Most applications of least-squares regression analysis assume tlj.it the 
variance of the individual F’s (in the universe) about the relevant expected 
value is the same for all combinations of values of the A s. While tiie 
assumptions of uniform variance can be relaxed in certain specific ways 


’ Earl O. Heady. John T. Pesck, and William G. Brown. Crop rc^pon'-c .surface-, and 
economic opiima in fertilizer use, p. .^25, Iona A^riculttira! S!ciiior< 


BuHclin 424, March 1955. 

* See Hald, Anders, Stathtical Vieory wUh Fxpineftip: Arphcjn 


t. pp. 52f. 


551-52, and 627, John Wiley and Sons. New York. I9.<2. 



334 Sampling Significance of Results 

the assumption of randomness of residuals about each segment of the 
regression surface cannot be modified 
If we fit a straight line to a set of observations in which the underlying 
relationship is really parabolic, we find that the line of averages of Y for 
different ranges of X departs markedly from the straight line and it does 
so in a systematic manner Statistical tests as to whether a relationship is 
curvilinear are based upon the departure of such group averages from the 
straight line (see Chapter 23, page 405) relative to the departure that might 
be expected as a result only of the random forces responsible for variations 
of individual observations about each group average 
In the com yield example of Chapter 14, “time” was introduced as an 
explicit variable (A'j) in the regression function This would be expected 
to go a long way toward randomizing the residuals with respect to time 
If “time” had been left out of the analysis, and if the partial regression 
curves of yields on rainfall and temperature had come out as shown in 
Figure 14 10 the residuals might very well have shown significant auto- 
correlation, for they would still have contained the systematic “time” 
efiects represented m the bottom section of that figure 
In many analyses involving time-senes observations, it seems plausible 
that “lime” way give rise to a special type of difficulty In the regression 
model, two or more successive residuals may be influenced by common 
factors rather than by completely independent (random) ones If the 
basic variables are essentially continuous over time, then in going from a 
high level (say) of temperature to a lower one we pass through all the 
intermediate values as well The closer together m time we take our 
successive temperature readings the greater will be the autocorrelation 
between them (i e the correlation of each item m the senes with its 
succeeding item) We have noted that such autocorrelation in the basic 
senes does not invalidate error formulas But if high autocorrelations do 
exist in the basic series it seems plausible that, for sufficiently short time 
lags, there could also be significant autocorrelation in the residuals 
Adjusting for Autocorrelation. Some of the earlier negativism 
concerning time senes still surrounds the question, “What do you do if 
there is significant autocorrelation in the residuals ■>” From a good many 
texts and technical articles the mam or only impression obtained is a 
negative one — “m this case the usual error formulas do not apply” 
Constructive answers — at least approximate ones — can be found in the 
literature as far back as 1935 ® 

The mam point is that autocorrelated senes give us less information 
per observation than do completely random ones In connection with the 
•SeeM S Bartlett, Someaspectsofthetime correlationproblemin regard totestsof 
sigmSaince, Journal o/ i/ie Jia/al S/alulieal SoCKtj', Vol XCVIII pp 536-543, 1935 



For Time Series 


335 

coTTclution modcL Btirllctt developed sn spproxirnntw correction rofTPU'x 
to deal ■v*ilh this fact. For example, we find that tre ternperaturr series 
of Chapter 14, during the period 1928-1956. is significantly autocorrehted 
at the 5-pcr cent probability level and autocorrelation in the rainfall 
series is just short of significance at the same level. The two coefruients 
are 0.368 and 0.231 respectively. According to Bartlett, this means that 
the standard error of the correlation coefheient between the two series ix 
increased in approximately the ratio 

/i + (.36SK.23n /Toss 

1 - (.368)(.23l) “V 0.915 " 

as compared with the usual formula. This could be interpreted aUo .as 
meaning that the information contained in our 29 annual observations is 
equivalent to only 29/(1.089)* = 24 strictly independent or random 
obscrv'ations. 

Wold (1953) takes a similarly constructive approach with respect to 
the meaning of autocorrelation in the residuals from time-series regression 
equations.^® Assuming that the universe is essentially stable over a given 
time period, and that the autocorrelation coefficient between successive 
residuals is pj, that between residuals lagged two time units is p,. and .so on, 
the standard errors of the partial regression coeffictents are increased 
relative to the usual formula in the ratio \ I -f 2p] -f 2p; -h . . . . If we 
assume that p^ = pf, pj = pj. and so on, the terms beyond pj wifi normally 
be quite small. 

If we treat the temperature series (in which pi = 0.36S) as if it were a 
series of residuals, p; (on our assumptions) would equal 0.135 and p^t 
would equal 0.050. If we consider only the term in p.. the standard errors 
arc increased in the ratio 1.318; if we include p; as well they are increased 
in the ratio 1.416. The 29 autocorrelatcd “residuals” would then give us 
about the same level of accuracy in estimating regawsion coefneients as 
w'ould 29/(1.416)- = 14 strictly random residuals. 

We turn now to some commonly used tests for the presence of significant 
autocorrelation in residuals and to a method of dealing with autocorre- 
lation (hat is sometimes useful in economic time senes, and might »aIso 
apply to some time scries encountered in other disciplines. 

Testing for Autocorrelation in the Residuals. In the early 1940 s. 
two tests for autocorrelation were developed. These involve, respectively, 
calculation of (1) a cocfncienl of autocorrelation, and (2) the ratio oftltc 
“mean-square successive difference” to the variance. The second of these, 
called von Neumann’s ratio, is based upon .somewnat less restrictive 

" Wold. Herman. Demand Arjh.ds, p. 2H. Equation 7. John Wilr,- end S-rs. NV-s 
York, 1953. 



336 Sampling Significance of fiesu/ts 

assumptions and since 1950 has been generally used m the analysis of 
economic time senes It is applicable, of course, to time senes ansing m 
other fields as well 

Calculation of the coefficient of autocorrelation may be illustrated 
using residuals from the steel cost analysis (Table 16 3) The residual for 
1921 is entered beside the residual for 1920, and so on, as shown m Table 
20 2 The residual for 1920 is entered beside that of 1937, which is not 

Table 20 2 


Testing for AurocoRRErj^TiON in Residuals from a Regression Line 
OR Surface (1) Using the CbEFnciENr of Autocorrelation* 


Year 

Residuals, 

I-ag^d 

Residuals, 

*1+1 

2* 


1920 

09 

44 

0 81 

3 96 

1921 

44 

-70 

19 36 

-3080 

1922 

-70 

1 5 

49 00 

-10 50 

1923 

1 5 

-18 

2 25 

-270 

1924 

-1 8 

08 

3 24 

-144 

1925 

08 

21 

64 

168 

1926 

21 

1 3 

441 

2 73 

1927 

1 3 

-05 

1 69 

-065 

1928 

-0 5 

-1 7 

25 

0 85 

1929 

-1 7 

10 

2 89 

-1 70 

1930 

10 

-07 

1 00 

-070 

1931 

-0 7 

59 

49 

-413 

1932 

59 

-2 8 

34 81 

-16 52 

1933 

-2 8 

-4 1 

7 84 

1148 

1934 

-4 1 

-1 1 

1681 

451 

1935 

-1 1 

1 3 

1 21 

-143 

1936 

1 3 

-0 8 

1 69 

-I(H 

1937 

-0 8 

09 

064 

-072 


• The residuals are those from the steel-cost example column (10) of Table 16 3 


illogical if we assume that all the residuals represent random drawings 
from the same universe If the residuals are derived from mathematically 

In fact, von Neumann s ratio was ongtnally developed in connection with ballistics 
experiments m which non random dianges in wind velocity and other variables interfered 
with attempts to measure the random error component in successive shots fired from a 
weapon — i e , the intrinsic level of accuracy of the weapon itself The generality of the 
method was immediately recognized ty von Neumann and others 




For Time Scries 

fitted regressioas. the means of =, and =,.i -a ill he 
of autocorrelation is calculated as follosvsy- 


m 


rcro, and the cv>cn'.c*rn; 


V- . 


-47.12 

1^.03 


-0.3162 


( 20 . 1 ) 


Residuals from a graphic analysis may not .“ium c.sactH’ to ^cro; in such 
cases the sums of squares and crossproducis arc adjusted to a dcsiation- 
from-mcan basis by the usual formulas fS.l). Jn the present example. 
Me, = —0.0722 (note that the means ofr, and z... are identical), and’the 
corrected value of r„ is 

-47.214 


This value is compared with tabulated values of the 5-pcr cent and 
l-pcrcent probability levels of a theoretical distribution ofautocorreLition 
coefficients derived by random sampling from a uniserse hasim: a true 
autocorrelation of zero (Table 20.3).’=' For a sample of 1 R observations, 
we estimate by interpolation tliat values of r,, smaller in absolute value 
than —0.425 could occur by chance as often as 5 times in 100, so the 
observed value, r„ = —0.317. does not indicate a significant degree of 
autocorrelation. We assume, therefore, that the usual error formulas arc 
applicable to regression coefficients and forecasts derised from the steel 
analysis if the underlying relationships among the \anabies continue as 
they were in 1 920-1 937.’^ 

A similar test applied to residuals from the Imcar-rcgrtNsion anahsis 
for com yields (Table 14.1) gave an value of —0.104. the 5 per cent 
probability level for samples of 38 observations is —0.287. so '.>.0 corcUidc 
that the usual error formulas apply to regression coefficients, and forecasts 
derive from the analysis of corn yields if the underlying rciatioriships 
continue as they were during 1890-1927. 

Another Test h Given by von Neumann’s Ratio. The calculation 
of von Neumann's ratio for the steel example is shov.n in tlie Table 20.4, 


” Since, in this case, the same items .appear in 'erics r, .md .. - 

equation (20.1) is identical with equation (S.3) for the simple c-jrre" !,<->■ 
between e, and e„.j. 

” Note that usage since 1950 applies the tcim awiorai'm,: ’>•'-•■ to 0. r cz<< 
Anderson; serial correlation is usually defined as coireiatiori b-:',-.' rcr. c.. tt: 
one series and lagged values of another series. 

“ Note that the formulas cited from Bartlett and We!.'! ins eh s'* the 
autocorrelation coefficients. In the present paragraph our tes! ci :he ' 0 “?’' 
leads us to accept the hypothesis that autocorrc’ution :n tr.; cr-.: r 
zero; if so. Wold's adjustment factor reduces to 1. 


.'.nd 




treated by 
t wjuis of 

.-‘..iJur'. I'f 
coefficient 
'.dijirTc or 



338 


Sampling S/gni/iconce of Resu/u 


T«ble 20 3 


5 AND 1 Per Cent Signihcance Points for the CoEFnciENT of 
Autocorrelation (Orcular Definition)* 


Sample 

Positive Tail 

Negative Tail 

N 

5 Per Cent Level 

1 Per Cent Level 

5 Per Cent Level 

! Per Cent Level 

5 

0253 

0 297 

-0 753 

-0 798 

6 

0 345 

0447 

-0 708 

-0 863 

7 

0 370 

0 510 

-0 674 

-0 799 

8 

0 371 

0 531 

-0 625 

-0 764 

9 

0 366 

0 533 

-0 593 

-0 737 

10 

0 360 

0 525 

-0 564 

-0 705 

11 

0 353 

0 515 

-0 539 

-0 679 

12 

0 348 

0 505 

-0 516 

-0 655 

13 

0 341 

0495 

-0 497 

-0 634 

14 

0 335 

0 485 

-0 479 

-0 615 

15 

0328 

0475 

-0462 

-0 597 

20 

0 299 

0 432 

-0 399 

-0524 

25 

0276 

0 398 

-0 356 

-0 473 

30 

0 257 

0 370 

-0 324 

-0433 

35 

0 242 

0 347 

-0 300 

-0401 

AO 

0229 

0329 

-0279 

-0376 

45 

0218 

0313 

-0262 

-0356 

50 

0 208 

0 301 

-0248 

-0339 

55 

0199 

0289 

-0236 

-0324 

60 

0191 

0278 

-0225 

-0310 

65 

0184 

0 268 

-0216 

-0298 

70 

0178 

0 259 

-0207 

-0 287 

75 

0174 

0250 

-0 201 

-0 276 

80 

0170 

0 246 

-0195 

-0 271 

85 

0165 

0239 

-0189 

-0 263 

90 

0161 

0 233 

-0184 

-0255 

95 

0157 

0^7 

-0179 

-0 248 

100 

0154 

0221 

-0 174 

-0 242 


•Adapted with thekind pennission of the editor, from R L Anderson, Distribution 
of the serial correlation Coefficient, Annals of Mathematical Statistics, Vol 13, No 1, 
pp 1-13,1942 withcorrections,aiwIreca(culationofvaluesforfV “ 80to 100,suggested 
by Anderson 

Autocorrelation is presumed to be present in the population if the computed value 
of the coefficient of autocorrelation exceeds the value at the preselected significance 
level for the particular sample size and at the appropriate tail of the distribution Use 
the positive tail for positive values of r, and the negative tail for negative values of r. 



340 


Sampling Significance ofResuItt 


graphic analyses) is usually tnvial and will be disregarded here 
in this example, 


( 5 * 


n — 1 


2 ^ 


We have. 


(20 2 ) 


» 


389 41 
17 

14903 


From the table of von Neumann’s ratio {Table 20 5), we find that values 
larger than 2 895 could occur as frequently as 5 per cent of the time in 
samples of 18 observations from a non autocorrelatcd universe, we 
therefore reject the hypothesis that the residuals from the steel regression 
are significantly autocorrelatcd 

The cortespondmg test for residuals from the Imear'rcgrcssion analysis 
of com yields (Table 14 I) is as follows 


£ 


1058 02 

37 

38 

28 596 _ 

' 12615 


Entering the table with a sample size of 38, we find that values of 2 589 
or more could occur 5 per cent of the time in random drawings from a 
non-autocorrelated universe Thus, we conclude that there is no significant 
autocorrelation in residuals from the com analysis, and that the usual 
standard error formulas apply to its regression coefficients and to individual 
forecasts for observations drawn from the same universe as those of 
1890-1927 

Ejects of Using First Differences Instead of Original Values. Re- 
gression analysis, mathematical or graphic, can be applied just as readily 
to series of first differences as to any other sets of numerical values In 
a time series of n original values, JTj, there will be « — I first dilTerences, 
— Xf If two or more economic time senes are intercorrelated as 
the result of trends which may not reflect logical or causal relations 
between them, the use of first differences will typically reduce intercorre- 
lation and increase the probabihty that the regression coefficients obtained 



Time Series 


341 


Table 20,5 


5 AND 1 Per Cent SiGNmcANrr. Points for no. Ratio ot thf 
Mean-Square SuctTssiNx-Dim rente to the Variance* 


Values of K 

Values of A" 

A' 

Values of K 

V.tlucs of K ’ 

P=0.01 

P=0.05 

P=Q.0S 

P=0.0I 

P<=0.01 

1! 

o 

o 

•vt 

r=0.05 

pt^OOl 

0.8341 

1.0406 

4.2927 

4.4992 

33 

1.2667 

1.48S5 

2-6365 

185S3 

0.6724 

1.0255 

3.9745 

4.3276 

34 

1.2761 

1.4951 

2.6262 

18451 

0.6738 

1.0682 

3.7318 

4.1262 

35 

1.2852 

1.5014 

2.6163 

18324 

0.7163 

1.0919 

3.574S 

3.9504 

36 

1.2940 

1.5075 

2.606S 


0.7575 

1.1228 

3.4486 

3.8139 

37 

1.3025 

1.5135 

2.5977 

180S5 

0.7974 

1.1524 

3.3476 

3.7025 

38 

1.3I0S 

1.5193 

2.5889 

2.7973 

0.8353 

1.1803 

3.2642 

3.6091 

39 

1.3188 

1.5249 

2.5804 

2.7865 

0.8706 

1.2062 

3.1938 

3.5294 

40 

1.3266 

1.5304 

2.5722 

2.7760 

0.9033 

1.2301 

3.1335 

3.4603 

41 

1.3342 

1.5357 

2.5643 

17658 

0.9336 

1.2521 

3.0812 

3.3996 

42 

1.3415 

1.540S 

2.5567 

2.7560 

0.9618 

1.2725 

3.0352 

3.3458 

43 

1.3486 

1.5458 

2.5494 

17466 

0.9880 

1.2914 

2.9943 

3.2977 

44 

1.3554 

1.5506 

2.5424 

17376 

1.0124 

1.3090 

19577 

3.2543 

45 

1 3620 

1.5552 

2.5357 

2.7289 

1.0352 

1.3253 

2.9247 

3.2148 

46 

1 3684 

1.5596 

2.5293 

2.7205 

1.0566 

1.3405 

2.8948 

3.1787 

47 

1.3745 

1.5638 

2.5232 

2.7125 

1.0766 

1.3547 

2.8675 

3.1456 

48 

1.3802 

1.5678 

2.5173 

2.7049 

1.0954 

1.3680 

2.8425 

3.1151 

49 

1.3856 

1.5716 

15117 

16977 

1.1131 

1.3805 

2.8195 

3.0S69 

50 

1.3907 

1.5752 

2.5064 

16908 

1.1298 

1.3923 

2.7982 

3.0607 

51 

1.3957 

1.5787 

2.5013 

2.6842 

1.1456 

1.4035 

2.7784 

3.0362 

52 

1.4007 

I.5S22 

14963 

2,6777 

1.1606 

1.4141 

2.7599 

3.0133 

53 

1.4057 

1.5856 

2.4914 

16712 

1.1748 

1.4241 

2.7426 

2.9919 

54 

1.4107 

1. 5890 

14866 

166-tS 

1.1883 

1.4336 

2.7264 

2.9718 

55 

1.4156 

1.5923 

14819 

16585 

1.2012 

1.4426 

2.7112 

2.952S 

56 

1.4203 

1.5955 

2.4773 

2.6524 

1.2135 

1.4512 

2.6969 

2.9348 

57 

1.4249 

1.5987 

14728 

16465 

1.2252 

1.4594 

2.6834 

2.9177 

58 

1.4294 

1.6019 

2.4684 

2.6407 

1.2363 

1.4672 

2.6707 

2.9016 

59 

1.4339 

1. 6051 

2.4640 

16350 

1.2469 

1.4746 

2.6587 

2.8864 

60 

1.4384 

1.60S2 

14596 

2.6294 

1.2570 

1.4SI7 

2.6473 

2.8720 







Adapted, with the kind permission of the editor, from B. 1. Hart, Significance Iciicls 
he ratio of the mean square successisc difference to the variance, At!>;aSs of Sfetke- 
cal Statistics, Vol. 13, No. 4. p. 446, !9-!2. 

: the civen level of significance and the .appropriate sample sire (.V), a computed K 
dicativc of positive autocorrelation if it falls below the critical Aaluc of K, and is 
ative of negative autocorrelation if it exceeds the corresponding critical value of 
if it falls between the two critical values, no evidence of autocorrelation is present. 


342 Sampfwg Significance of Results 

Will represent meaningful relationships If there is positive autocorre- 
lation in the residuals from an analysis based on original values, residuals 
from the corresponding first-diffetenoe analysts typically show lower 
and/or non-significant autocorrelation 
Slutsky and others have observed that the cumulative sums of a series of 
random numhcTS will trace out qrclical, or positively autocorrelated, 
patterns For example, suppose we have a random sequence composed of 
the values 1 and — 1 The cumulative sum of these values will fluctuate 
around zero But if it reaches a value of I there is a 50 per cent chance 
that the next observation will raise u to 2 If the sum attains the value of 2, 
the next observation will change it to 3 or 1. both of which contribute to 
positive autocorrelation If the cumulative sum were —2 at the end of n 
observations, the next observation would change it to —3 or —I, again 
contributing to positive autocorrelation 
By analogy, the residuals Zj+i, Zt+j. may be thought of as acumulative 

sum of first differences (zj*, — z„ z,+,j — z,+„ etc ) added to the first 
residual t, If the actual residuals are autocorrelated, their first differences 
may still be random This is not mcviubly true, but it is frequently found 
to be true in practice 

Some examples reported by Cochrane and Orcutt, based upon the work 
of Richard Stone, are shown m Table 20 6 
Referring to Table 20 5, we find that the I-per cent and 5-per cent points 
for testing positive autocorrelation in samples of 19 observations are 
1 0766 and 1 3547— -i e , values hwer than these are indicative of positive 
autocorrelation Of the 13 analyses based on original values, 10 showed 
significant autocorrelation at the 5-per cent level, 3 of these were 
significant at the 1-per cent level It is noteworthy that the inclusion of a 
linear time trend in the analyses based on original data did not eliminate 
autocorrelation, 6 of the 8 analyses that included a trend factor showed 
significant autocorrelation 

The results for the 11 first-difference analyses are stnkingly different 
Not one of them shows significant autocorrelation, one veers toward 
negative autocorrelation, but not significantly so 
Although the above results based on ongmal data may not be typical 
of those which might be obtained m other time-senes studies, the effect 
of first-difference transformations when the residuals are positively 
autocorrelated will quite generally be in the direction shown 

The common sense meaning of a regression between first differences is as follows 
If (say) high original values of price are assooated with low original values of con- 
sumption and vice versa, it follows logically that a change from a high price to a low 
price Will be associated with a change from a low-consumption to a high-consumption 
figure In strictly random samples linear r^ression coefficients estimated from first 
differences will closely approximate those obtained from the original values 



For Time Series 


343 


Table 20.6 


Values of vos Neumavs'-s Ratio for a Nump?.r of Demavd STurnFS fo 
THE United Kingdom. 1920-1938 


Commodity 

Number of 
Parameters 

von Neumann's Ratio 

Origin.a! 

Data 

First 

DilTcrcRccs 

Beer 

3 

1.2S 

1.86 


4 

1.13 

2.01 


4 -f time 

1.23 


Spirits 

3 -f- time 

1.26 

2.63 

Telegrams 

3 

1.24 

1.61 


4 -f time 

1.10 

1.65 

Imported wine 

4 

1.49 

1.S4 

Communication scHiiccs 

3 -1- time 

0.71 

2.05 


4 + time 

0.70 

2.11 

Lard 

3 + time 

0.90 

2.06 

Margarine 

4 

1.26 

l.SO 


4 -f time 

2.02 



5 -f time 

2.31 

2.31 

Mean value of 


1.2S 

1.99 


Source: D. Cochrane, and G. H. Orcutt, Application of least-squares regression to 
relationships containing autocorrclated error termi. Journal of the Americah Stataticai 
Asaociation, Vol. 44, No. 245, pp. 32-61, March. 1949. 


Testing for Changes in Regression Relationships over Time. Some 
genera! considerations and tests based upon the randomness and size of 
residuals have already been outlined. Where a regression surface has 
been fitted mathematically, suspected changes in the true relationship 
may be tested by adding a “discontinuity s'ariablc” svhich takes the 
value 0 for all years prior to a given date and the value I for ail subsequent 
years. This variable will have a mean and a standard deviation both 
between 0 and 1 : the usual tests of significance can be applied to the net 
regression of the dependent variable upon the “discontinuity variable. ’ 



344 Sampling Significance of Resulu 

A significant net regression coefiictent implies a sigmficant change m the 
relationship from one period to the other As the same net regression 
coefficients between the dependent variable and the “real” independent 
vanables are applied to the estimates for both penods, a significam 
discontinuity coefficient implies that the residuals before and after the 
given date probably represent drawings from different universes 

For example, we might be uncertain as to whether the observed values 
of certain economic vanables after World War II may be regarded as 
coming from the same statistical universe as those for 1922-1941 A 
significant discontinuity coefficient would lead us to answer this question 
in the negative We might then proceed to fit separate regression equations 
to each of the two penods and compare the results In some cases it may 
be only the constant term, a, that is significantly different m the two 
periods, if so, data for the two penods may be combined for the purpose 
of estimating the b values, but with the discontinuity vanable added to 
reflect the shift in the value of the constant term 

Further Comments on ‘^Practical'’ Forecoiting In Time Series. In 
Chapter 19 we described the error formulas applicable to forecasts of 
individual values of a dependent vanable for new observations not 
included in the sample on which the regression equation was based The 
error zone for such forecasts may prove to be fairly wide, particularly 
for unusual combinations of values of the independent vanables, and 
m no event can it be smaller than the standard error of estimate In 
random sampling from a stable universe, 5 observations in 100, on the 
average, will depart from the regression estimate by more than two 
standard errors of forecast 

The same probability statements apply to forecasts from regression 
equations based on time senes if the relevant universe remains stable over 
time In deciding this, particularly for cases in which the observed time 
senes are short, the statistician must draw on the theory of his field and 
his experience with other time-senes analyses of similar variables, as well 
as upon the evidence in the particular set of observations at hand If 
large numbers of observations are available, as with certain types of 
meteorological data, the stability of the universe can be investigated by 
fitting the same regression function to the data for each of several non* 
overlapping periods 

Some economists have found diagrams such as Figure 20 1 helpful 
in considering the ways in which a universe extended over time may be 
changing, and m appraising which of its parameters may have been most 
seriously altered This particular diagram refers to a universe in which 
the observations are annua! totak or averages for the entire United States 
If we were interested in a universe of weekly observations of retail beef 



For Time Scries 


34S 


prices and quantities purchased by consumers in a panicular to^s-n. oar 
diagram would include a somewhat difTcrent set of factors and some of 
the arrows representing directions of influence misht be reversed. 



Fig. 20.!. The demand and supply structure for beef. Arrou-s show direction of 
influence. Heavy arrows indicate major patlis of influence, which account for the 
bulk of the variation in current prices. Light solid arrows indicate definite but less 
important paths; dashed arrows indicate paths of negligible, doubtful, or 
occasional importance. (Karl Fo.x, Analysis of demand for farm products. United 
States Dept, of Agriculture, Technical BuHain lOSI, p. 3-J, Sept., 1953.) 

Suppose that we had determined the simple regression of "farm price 
of slaughter cattle" upon the “number of cattle and calves on farms. 
January 1" for some period such as 1922-1941, and that the relationship 
was close enough to have considerable forccasling value during those 
years. Does it appear likely that this same regression line will still apply 
in the 1950’s? 

At any given time, each arrow in Figure 20.1 stands for a regression 
coefficient (or line or curve) connecting two variables. If v-c follow only 













346 Sampling Significance of Results 

the heavy arrows, the diagram implies that there are five major links 
in the chain connecting the two variables in which we are interested 
Stability of the regression between “farm price” and ‘‘January I numbers” 
over the years implies at the least (1) that the coefficients attaching to 
each link have remained, nearly constant, or (2) that changes in two or 
more coefficients have approximately offset one another Changes in 
the links connecting disposable consumer income and the supply of other 
meats with the retail price of beef could also affect the fluctuations of 
farm price relative to variations m January I numbers If one were 
considenng the lagged effect of price on subsequent supply, other arrows 
would be needed and other relationships besides those shown would have 
to be examined 

Some of the individual links could be checked by simple inspection 
of time series of the two variables involved, or by comparing a freehand 
regression based on a few recent years with a similar regression for the 
1922-1941 period Thus, the link between ‘‘number of cattle slaughtered ’ 
and ‘‘beef production” involves only a question of fact as to whether the 
average carcass weight of cattle slaughtered today is about the same as 
It was during 1922-1941 

If we were interested m the stability from 1922-1941 to date of a 
universe consisting only of the one link just described, it should be quite 
easy to decide whether or not this universe had changed The stability 
of a universe involving ‘‘beef consumption” and the ‘‘retail price of beef" 
might have to be tested simultaneously with the stability of universes 
involving (1) the retail price of beef and disposable consumer income, 
and (2) the retail price of beef and the supply of other meats and poultry 

Figure 20 1 is a logical device rather than a statistical one In this 
context It serves mainly as a check list of ways in which a time senes 
universe could change It is almost certain that some elements m a universe 
as complex as the whole of Figure 20 I will change over a twenty year 
period But changes in a few elements may leave regression relationships 
in many of the ‘‘subuniverses” of Figure 20 1 substantially unaltered 

Summary 

In this chapter we have discussed the applicability of standard-error 
formulas to regression analyses based on time senes, we have shown 
how these can still be applied in many cases, and presented some statistical 
tests for determining whether or not a particular set of Ume-seties observa- 
tions meets the requirements for such application We also pointed out 
that knowledge of the theory of the subject matter m question and 
experience with other time-series analyses of similar variables are important 



For Time Series 


347 


in deciding whether a particular set of time-rerics ob'cr%-a!icn5 can he 
regarded as a sample from a stable universe. Our position is that the 
special difficulties and uncertainties associated with time-series analy-xis 
have frequently been exaggerated; that the strictly statistical problems 
of time series, particularly those of autocorrelation, can be dc-iU with 
scientifically; and that the results of sample surscys or controlled experi- 
ments may also get out of date with the passage of time. 

REFERENCES 

Yule, G. U., Why do we sometimes get nonscnsc-correlitions between time-senes'?— -A 
study in sampling and the nature of time series. Jou'. Roy. Stat. See., Vol. ?9, 
No. l.pp. 1-64. 1926. 

Koopmans, Tjalling C,, Linear Regression Analysis of Economic Time Series, Haarlem, 
De erven, F. Bohn, n.v., 150 pp., 1937. 

Bartlett, M. S., Some aspects of the time correlation prob'em in regard to tests of 
significance, yo:/r. Roy. Stat. Soc., Vol. XCVnf. pp. 536-5'i.^, 1935. 

Wold, Herman, in association with Lars Jurccn. Demand Anahsis. A Stu.fy in Ecoro. 
metrics. John Wilcj' and Sons, New Yodt. 1953. Especially Chapter 2, pp. 43-t5. 
and Ch.aptcr 13, pp. 209-213. 

Anderson, R. L., Tlie problem of autocorrelation in regression nnahsh, Jo:ir. Amc'. 

Stat. Assoc., Vol. 49, No. 225, p. 113. .March. 1954. 

Durbin, J., and G.S. Watson, Testing for Serial Correlation in Least Squares Regression, 
Biometrika, Vol. 38, nos. 1-2, 1951, pp. 159-177. 



SEQION VI 


Miscellaneous 
Special Regression Methods 


CHAPTER 21 

Measuring the relation between 
one variable and two or more 
others operating jointly 


In working out the change in one variable with changes in other 
vanables up to this point we have assumed that the relation of the 
dependent factor to each independent factor did not change, no matter 
what combination of other independent factors was present In the 
case of the yield of com, for example, as worked out m Chapter 14, we 
assumed that the effect of a given change in rainfall upon the yield was 
the same, no matter what was the temperature for the season The 
significance of this assumption may be shown by combining the estimate 
for rainfall with the estimate for temperature, and plotting the combined 
influence of the two vanables In Table 14 19 we already have this 
combined influence worked out, so all we have to do is to plot it Figure 
21 I shows the resulting figure In this figure inches of rainfall are read 
alon^the n^t-hand ed^ ofthebattom.ofthfi cuhe.,deg’ees.aftemi 5 eralure 
along the left-hand edge, and the yield along the vertical edge The yield 
for any combination of temperature and rainfall is shown by the distance 
the upper surface of the solid figure is above the point of intersection of 
the corresponding values m the base plane * 

* The way this figure is made may be thought of as follows Suppose we drew a 
senes of charts of the estimated difTerences in yield with differences in rainfall, with 
one chart for an average temperature of 70’, one for 72’, one for 74’, etc Then if 
we cut these charts off at the yield line, and arrange them one back of the other, at 
even distances, we have a figure looking mudi like Figure 21 1 The lines sloping 
348 



For joint Regression Surfaces ^49 

Inspecting Figure 21.1, ue can now see what is meant bv sayinc that 
the changes in yield are assumed to be the same for each cliancc in rain* 
fail, no matter Vi hat the temperature. As shoun in the hcurc. the maximum 
yield with a temperature of /O degrees is obtained at abrout 12 inches of 
rain and that is also the rainfall which produces a maximum yield with 
a temperature of 72, 74, or 78 degrees. Each curve has the same shape, 
and the only difference is their elevation above the base. On loohins; 



Fig. 21.1. Probable yield of com for sarious specific combinations of rainfall 
and temperature, from multiple curwiincar regression. 


at it the other way, we find that the same is true of temperature. With 
9 inches of rainfall the maximum yield is obtained at about 75 degrees, 
and the maximum is also at 75 degrees with other levels of rainfall. This 
relation necessarily follows the assumptions made in measuring it. Figure 
21.1 merely show's the estimate we get by the use of equation {14.1): 


A'l = u +/2(.W) -b A(A'a) 

In working out these estimates we simply add together the estimated 
value for Xn and the estimated value for .V 5 . It docs not make any dif- 
ference what the value of is. the changes in .X'j assumed to accompany 


across the surface from left 10 right nrprcscni what would be ihe tops of this 'cries 
of charts. (In this fieurc the estimates arc charted for all combinations of the two 
variables, esen for some not represented in the sample and nos shown in Table 


3SQ Special Regression Methods 

particular changes in A'g are the same — and that is what the figure shows 
Only a little reflection is needed to indicate that Figure 21 1 may not 
tell the whole truth of the relation of yield to rainfall and temperature 
It IS quite possible that the crop can use more ram in a hot season than in 
a cool one, so that the rainfall which will produce the maximum crop 
may be higher in a season of high average temperature than m a season 
of low temperature If that is really the case, equation (14 1) is unable to 



Fig 2] 2 Differences in mortality with differences m 
weight for men of various ages (each in percentage of 
average mortality for that age) Illustration taken from 
an article by Andrew Court 

express the relationship, for, that equation assumes that the change in 
yield with rainfall is the same^no matter what the tempecatuie 

An extreme illustration of a changing relationship is shown m Figure 
21 2 This figure, which is based on actuanal investigations,* shows the 
differences in mortality among men from the usual rate, for differences 
in weight at different ages Taking the 22-year line, for example, we 
see that men who arc much over normal weight have a much higher 
mortality than normal for that age Then as the weight is less the mortality 
is less, until at normal weight there is only normal mortality But as the 
* Medico-Aetuana! Investigations Vol II p 24, 1913 


For Joint Mgression Surfaces ^st 

weight drops still more, the monality increases again, until FO 

per cent of the normal v,cisht the mortality is more than 20 per cent in 
excess of normal. 

The relation is different for 52-ycar-old men. hot\c\cr. }-or them the 
mortality is also higher for those above normal weicht and decrca'es as 
normal weight is reached. But as the weight falls below ncrmal the 
mortality continues to decrease, until for men who arc onlv 70 percent of 



Fig. 2 1 Relation shown in previous ficurc, repre<entcc! bv equation 

normal weight, the mortality is more than 15 p;r cent heh'.v the normal 
for that age. For ages intermediate between these two, the change is 
also intermediate — as is showm in the cliart, 27 scars is similar to 22, 
but not so marked, and the line for 47 scars is similar to that for 52. Ai 
42 years, there is apparently little difference in mortality ansnsherc between 
70 per cent of normal weight and 100 per cent, but there is an csen more 
marked rise in mortality with osenscight than at higher or losscr ages. 

The previous methods of anahsis would be quite incapable of dealing 
adequately svilh the relationship shossn in Figure 21.2. \Scre equation 
(14.1) used to represent this relation, the higher mortahty with lower 
s\ eights for vounti men ss'ould tend to balance out the losser mortality 
for the older men at the same sveight. In fact, the erroneous conclusion 


3S2 Special Regression Methods 

might be reached that the age does not affect the relation between weight 
and mortality Figure 21 3 shows the results of an attempt to represent 
this relation by the methods previously discussed It is quite obvious that 
this representation leaves out many of the important facts 

Use of *7oint Functions” to Show Combined Effects. What is needed 
in both the com yield problem and the mortality problem is some way 
of determining what the yield m the one case, or the mortality m the 
other, IS most likely to be for any given cotnbinauon of the two independent 
variables That is quite different from asking for the separate effect of 
each one Obviously, a small change m one independent factor will be 
expected to be accompanied by only a small change in the dependent, 
so that all the estimated yields (or mortalities) will be expected to lie along 
a continuous surface, but the surface will be free to warp or change its 
shape in different portions like the surface shown m Figure 21 2, instead 
of being held rigidly to the same shape in each dimension, like the surfaces 
m Figures 21 1 or 21 3 Mathematically, such a changing relation between 
one variable and two or more others is known as a joint functional relationy 
and may be indicated by the equation 

( 211 ) 

This IS read simply that “Xj is a joint function of Xg and X's” That 
means only that, for any combination of values of X^ and X^, there will 
be some particular value of Xi Equation (21 1) is therefore capable of 
representing either a relation such as that shown m Figure 21 1, or the 
more complex relation shown m Figure 21 2 

The problem of determining the extent to which com yield vanes with 
the joint effect of temperature and rainfall may be said to be one of 
determining the functional relation of yield to the two other factors, 
according to the relation shown in equation (21 1) 


Determining a Joint Function for Two Independent Variables 

Where only two independent vanables are concerned, the joint functional 
relation may be determined quite simply, if a large enough number of 
observations is available 

The process may be illustrated by data from a different problem, shown 
m Table 21 1 The observations are from a field study of haystack dimen* 
sions in the Great Plains area, made in the late 1920’s The very rapid 
introduction of pickup hay-balmg machines since 1945 has largely 
eliminated the former practice of selling hay loose in the stack (The 
number of pickup hay balers in the Umted States increased elevenfold 



For Joint Regression Surfaces 


3S3 


Table 21.1 


Data Taken from NenRASKA Round Stacks MrAstKin in 1927 avo 5«2'* 


Volume 

(cubic 

feet) 

Circum- 

ference 

(feet) 

“Oscr" 

(feet) 

A',t 

A',f 

AV 

a; 

r 

2853.00 

69.0 

37.00 

0.139 

0.I6S 

0.455 

0 4S5 

— n pro 

2702.00 

65.0 

36.50 

0.113 

0.162 

0.432 

0.445 

-0 013 

3099.00 

73.0 

38.50 

0.163 

0185 

0.491 

0 545 

-0 054 

1306.00 

62.5 

26.50 

0.096 

0.023 

0.116 

:0.)25 

-0fi<>9 

2294.00 

70.0 

35.00 

0.145 

0.144 

0.361 

0 440 

-0 079 

2725.00 

68.0 

36.50 

0.133 

0.162 

0 435 

0 465 

-OO.tO 

3309.00 

71.0 

39.25 

O.ISI 

0.194 

0.520 

0.550 

-0 030 

2790.00 

64.0 

36.75 

0.106 

0.165 

0.446 

0.440 

OfKW 

2756.00 

62.0 

38.50 

0.092 

0.1 S5 

0 440 

0.455 

-0 015 

5237.92 

80.0 

43.00 

0.203 

0.233 

0.719 

0.700 

0 019 

3149.82 

67.0 

37.60 

0.126 

0.175 

0.49S 

0 480 

0.018 

5498.46 

79.0 

44.60 

0.198 

0.249 

0.740 

0.745 

-0 P'35 

3397.83 

66.0 

38.00 

0.120 

0.180 

0.531 

0 480 

0 051 

3007.56 

62.0 

36.80 

0.092 

0.165 

0.478 

0.445 

0 033 

4574.29 

79.0 

41.10 

0.193 

0.214 

0.660 

0 650 

0010 

6228.59 

73.0 

48.00 

0.163 

0.2S1 

0.794 

20 815 

-0 021 

2318.64 

63.0 

30.20 

0.099 

O.OSO 

0 365 

0 270 

0 095 

3176.71 

68.0 

37.75 

0.133 

0.177 

0 502 

0 490 

0.012 

2352.31 

70.0 

32.50 

0.145 

0.112 

0 371 

0.375 

-ooai 

2174.44 

69.0 

31.62 

0.139 

0.100 

0.337 

0.340 

-0003 

2694.72 

73.0 

34.50 

0.163 

0.138 

0.431 

0.440 

-0 009 

3333.53 

70,0 

37.25 

0.145 

0.171 

0 523 

0.495 

0.028 

4328.92 

78.5 

40.00 

0.195 

0.202 

0 636 

0.610 

0026 

2115.04 

67.0 

31.25 

0.126 

0.095 

0 325 

0.320 

OOOS 

2489.08 

66.5 

33.75 

0.123 

0.128 

0.396 

0.390 

0 006 

2296.65 

64.5 

32.38 

O.IIO 

0.110 

0.361 

0.340 

0 021 

3117.21 

65.5 

37.58 

0.116 

0.175 

0 494 

0.470 

0 024 

4088.36 

74.0 

40.33 

0.169 

0.206 

0612 

0.600 

0012 

4180.88 

72.0 

40.50 

0.157 

0.207 

0.621 

0.590 

0.031 

2318.19 

63.0 

33.00 

0.099 

0.119 

0.365 

0.345 

0.020 

1946.90 

58.0 

31.00 

0.063 

0.091 

0 289 

0 265 

0.024 

2479.89 

61.0 

36.50 

0.0S6 

0.162 

0.394 

0.405 

-0.01 1 

3174.80 

73.0 

37.00 

0.163 

0.1 6S 

0.502 

0.505 

-0003 

2151.54 

64.0 

33.00 

0.106 

0.119 

0.333 

0 355 

-0 022 

3475.68 

73.0 

39.50 

0.163 

0.197 

0.541 

0.575 

-0 034 

4393.08 

71.0 

42.00 

0.151 

0.223 

0.643 

0.620 

0 014 

2819.50 

69.0 

35.00 

0.139 

0.144 

0.450 

0.435 

0015 

3703.49 

70.0 

38.50 

0.145 

0.185 

0.569 

0 525 

0.044 


* Acknowledgment is due W. H. Hosterman, of the Ilurcau of Agncullural 
Economics, U. S. Department of Agrioilture, for the use of these data, 
t X. = logio (circumference) - I TOO. stated to three decimal pi.accs 
A', = logio (“os-cr”) - 1.4, stated to three decimal places. 

A', = logio (volume) — 3.0, slated to three decimal places 
t Estimated by extrapolation of the surface shown on Tig 21.4. 


354 


Special Regression Methods 


Table 21.1 (Cont/nued) 


Volume Circum- 

(cubic ference “Over” Xi\ * 

feet) (feet) (feet) 


2742 81 72 5 

3002 40 66 0 

1854 19 69 0 

1982 07 62 0 

2470 86 65 0 

1203 15 60 1 

2843 84 71 0 

2636 25 66 0 

1998 39 65 0 

2005 03 64 0 

2568 76 66 0 

2161 18 65 0 

2112 20 67 0 

3009 33 65 0 

1992 24 63 0 

2746 98 70 0 

2238 27 64 0 

1747 47 67 0 

2863 91 67 0 

3593 47 72 0 

2435 48 62 0 

2430 18 63 0 

2590 07 67 0 

3577 68 700 

3299 24 73 0 

1986 14 64 0 

310904 680 

2821 56 71 0 

2932 24 67 0 

3304 63 69 0 

2565 46 72 0 

4509 93 74 0 

4804 01 81 0 

4241 80 75 0 

4516 10 69 2 

5011 62 77 5 

2110 73 65 0 

2775 70 76 0 

3927 90 72 0 

4212 77 800 

3562 64 78 5 


34 SO 0 160 

35 50 0120 

30 50 0 139 

31 00 0 092 

33 50 0113 

26 25 0 079 

36 00 0151 

36 00 0 120 

32 00 0113 

32 00 0 106 

35 00 0120 

32 50 0113 

32 00 0126 

3800 0113 

31 00 0099 

34 00 0145 

35 00 0 106 

30 00 0 126 

3600 0126 

39 00 0 157 

3500 0 092 

34 00 0 099 

35 00 0 126 

4100 0I4S 

40 00 0163 

32 SO 0 106 

38 00 0 133 

37 00 0151 

38 00 0 126 

38 00 0 139 

35 00 0157 

41 33 0 169 

42 00 0 208 

4073 0373 

43 25 0 140 

43 10 0 189 

31 50 0113 

3460 0181 

39 00 0157 

41 SO 0 203 

38 50 0195 


0 136 0 438 

0 150 0 477 

0 084 0 268 

0 091 0 297 

0 125 0 393 

0 019 0 080 

0156 0 454 

0156 0 421 

0 105 0 301 

0 105 0 302 

0144 0 410 

0112 0 335 

0 105 0 325 

0180 0 478 

0 091 0 299 

0 131 0 439 

0 144 0 350 

0 077 0 242 

0156 0 457 

0191 0 555 

0144 0 387 

0131 0 386 

0 144 0 413 

0213 0 554 

0 202 0 518 

0112 0 298 

0180 0 493 

0 168 0 450 

0180 0 467 

0180 0 519 

0144 0 409 

0 216 0 654 

0 223 0 682 

Si&’J 
0 236 0 655 

0 234 0 700 

0 098 0 324 

0 139 0 443 

0 191 0 594 

0 218 0 624 

0 185 0 552 


0 440 -0 002 

0 430 0 047 

0 295 -0027 

0 295 0 002 

0 375 OOJ8 

;0 120 -0 040 

0 465 -0 011 

0 440 -0 019 

0 335 -0 034 

0 330 -0 028 

0 420 -0 010 

0 345 -0 010 

0 345 -0 020 

0 475 0 003 

0 295 0 004 

0415 0 024 

0 400 -0050 

0 275 -0033 

0450 0007 

0 550 0 005 

0 385 0002 

0 370 0016 

0425 -0 012 

0 585 -0031 

0 585 -0 067 

0310 -0012 

0 500 -0 007 

0 495 -0 045 

0 490 -0 023 

0 505 0014 

0 445 - 0 036 

0 625 0 029 

0 680 0002 

£l£20 Om7 

0 630 0 025 

0 695 0 005 

0 320 0 004 

0 450 -0 007 

0 550 0 044 

0 665 -0 041 

0 575 -0 023 


See footnotes on first page of table 



For Joint Regression Surfaces 


3SS 


Table 21.1 (Continued) 


Volume 

(cubic 

feet) 

Circum- 

ference 

(feet) 

“Over" 

(feet) 

As* 

2S53.96 

75.0 

35.50 

0.175 

3294.38 

69.0 

38.00 

0.139 

16S9.54 

63.0 

30.50 

0.099 

2228.84 

62 0 

33.00 

0 092 

2362.61 

640 

34.00 

0.106 

30S8.28 

68.0 

38.50 

0.133 

3820.79 

70.0 

40.00 

0.145 

3I26.&4 

63.0 

36.90 

0 099 

3624.75 

71.0 

38.45 

0.151 

3023.97 

73.0 

36.50 

0 163 

6045.42 

79.0 

47.00 

0.I9S 

3100.11 

640 

37.00 

0 106 

3378.07 

70.0 

38.00 

0.145 

3040.29 

77.0 

35.00 

0.186 

2252.16 

65.0 

3150 

0.113 

3552.61 

76.0 

37.00 

0.181 

2635.90 

66.0 

34.50 

0.120 

3201.41 

71.0 

35.50 

0.151 

2590.21 

69.0 

35.00 

0.139 

3743.55 

76.0 

38.25 

0.181 

3858.03 

73.0 

39.50 

0.163 

3829.44 

74.0 

39.75 

0 169 

2556.44 

66.0 

33.00 

0.120 

3119.07 

69.0 

36.00 

0 139 

2122.38 

65.5 

3100 

0.116 

2921.92 

69.0 

36.00 

0.139 

2936.35 

72.5 

34.50 

0 160 

2427.66 

76.0 

33.00 

0.181 

2069.38 

65.0 

31.50 

0.113 

1899.54 

72.0 

30 00 

0 157 

4289.28 

78.5 

40.50 

0.195 

2407.39 

67.5 

3150 

0.129 

3097.99 

66.0 

35.50 

0.120 

3893.67 

75.5 

39.25 

0.178 

2238.66 

68.0 

31.75 

0.133 

2314.79 

64.0 

33.10 

0106 

2667.07 

66.0 

34.70 

0.120 

2582.07 

68.0 

33.50 

0.133 

3426.50 

75.0 

37.00 

0175 

2307.34 

60.0 

33.40 

0 078 

3960.41 

76.0 

39.30 

0.18! 


.V,t 

A's* 

a; 

- 

0.150 

0455 

0475 

-0 020 

0.1 80 

0 518 

0 505 

0 013 

0 0^ 

0 223 

0 250 

-0 052 

0119 

0.34R 

0 349 

0 008 

0 131 

0.373 

0 355 

OOlf 

0 185 

0 490 

0.510 

-0 020 

0.202 

0 532 

0 5£fl 

0 022 

0167 

0 495 

0 435 

OO^-n 

0 185 

0 559 

0 530 

0 023 

0 162 

0 450 

0 490 

-0010 

0 272 

0 781 

:o£os 

-0 024 

0.168 

0 491 

0 4.}5 

0046 

0.180 

0.529 

0 510 

0 014 

0.144 

0 4S3 

0 465 

0018 

0 II2 

0 353 

0 345 

0(V!S 

0 168 

0 551 

0 520 

0 031 

0 138 

0 42! 

0 405 

0 016 

0150 

0.505 

0 455 

0 050 

0.144 

0413 

0.435 

-0 022 

0.183 

0.573 

0 560 

0 013 

0.197 

0 586 

0 575 

0 011 

0.199 

0 583 

0 585 

-0002 

0 119 

0 408 

0 365 

0043 

0 156 

0 494 

0 4''-0 

0 034 

0.105 

0 327 

0 335 

-0 008 

0 156 

0 466 

0 460 

0 006 

0.1 3S 

0 46': 

0 435 

0 033 

0 119 

0 385 

0 405 

-0 020 

0 098 

0316 

0 320 

-0 004 

0 077 

0 279 

0 2-5 

-0006 

0.207 

0 632 

0 630 

0 092 

0.112 

0.381 

0 360 

0 021 

0.150 

0 491 

0 430 

0 061 

0.194 

0.590 

0 585 

0 005 

0.102 

0 350 

0.340 

0010 

0.120 

0 364 

0.355 

0 0G9 

0.140 

0426 

0410 

0 016 

0.125 

0 412 

0 395 

0017 

0.1 68 

0 535 

0 520 

0015 

0 124 

0 363 

0 335 

0 02= 

0.194 

0 593 

0 5«5 

0013 


See footnotes on first page of table. 


356 Special Regression Methods 

between 1945 and 1956 ) However, the data still provide an excellent 
illustration of the general problem of joint functional relations 
At the time of the study, farmers in this area ordinarily sold their hay 
unbaled and m the stack It was therefore necessary to estimate the 
quantity of hay in each stack Two measurements, which could be made 
readily with only a rope, were usually employed— the perimeter around 
the base of the stack and the '‘over,” or the distance from the ground on 
one side of the stack over the center to the ground on the other 
The observations shown m Table 21 1 are all for round stacks These 
stacks vary m height and shape to some extent, however, so their volume 
cannot be computed from the basal circumference by any simple mathe- 
matical rule The volumes shown in the table were computed from careful 
surveymg measurements of all the dimensions of each stack — much 
more exact measurements than a fanner would be able to make m practice 
The problem is to establish the average volume for specified circumferences 
and “overs,” so the farmers might use these two measurements, and 
also to determine how reliable are these estimates 
The volume will tend to be some function of the basal area tunes the 
height The basal area is a function of the square of the basal circum- 
ference, the “over” is a function of both the basal diameter and the 
height— but attempts to separate the two have been unsuccessful It is 
obvious that because of the multiplying nature of the relations, 

volume » /(circumference)(over) 

Such a relationship may be approached by use of the relation 

logvoh.m. =/(IOgrtra,mR,tK.) +/(10 &y«) 

Attempts to determine the relationship by this equation, however, have 
not been fully successful The shape of the stacks apparently shifts with 
changes in size The problem is evidently one where the relation may best 
be expressed by a jomt function such as 

volume 5= /(circumference, over) 

Such a relation could be determined directly from the data by the 
methods which will presently be desenbed It is evident that the correla- 
tion surface would have a marked upward slope as the two dimensions 
increased together, even if the usual volume formulas applied The work 
may be somewhat simplified by stating each vanable as a loganthm and 
then determimng the joint relation according to the equation 

“,^Oogeiretiinference, ^®Sovet) 



For Joint Regression Surfaces ^$7 

The logarithms (to base iOj arc entered in Table 21.1. dcslcnared a=i X.. 
X 3 , and AV (1.7 Itas been subtracted from the logarithm for ctrea'm- 
ference, 1.4 from the logarithm for “over.'’ and 3.o'from the logarithm 
for volume.) 

The joint function may be determined cither by fitting some .appropriate 
algebraic equation, or by graphic procc-^ses. Only in rare c.ascs vil! there 
be a good logical basis forjudging the form of the joint function to be 
expected. In most cases, therefore, even the algebraic equation must 
be selected with some reference to its ability to^represeni the type of 
joint function shown in the data, as empirically determined by some form 
of graphic examination. The methods of determining the joint function 
will therefore first be illustrated for the graphic method, and appropriate 
mathematical equations to represent various forms of joint surfaces vili 
be considered later. 

Subgrouping and Averaging the Observations. The first step is to 
classify the observations according to A;, and subdassify accordins to 
A'a, and determine the averages of A'j, A;, and X 3 for each group. It is 
not worth while to make too many groups. Four groups each way would 
give 16 subgroups, and 5 each way would gi\e 25. If the cases were 
uniformly distributed through 25 subgroups, that would make less than 
5 cases to a group, which is rather thin for a satisfactory average. The 
cases will not necessarily be distributed uniformly, so it may be best 10 
try the fivefold classification. The results arc shown in Table 21.2. 

Table 21.2 


Nu.mber of Haystack Observations. Classihed AcroRDiNo to .V, and X , 
(Logarithms of Circumference and “Over") 





A'; Values 









2 ViiluCS * 

Under 

0.090- 

0 . 120 - 

0.150- 

O.JSO 


0.090 

0.119 

0.149 

0.179 

and o\cr 

Under 0.100 

2 

•Y 

/ 

3 

1 


0.100-0.139 

1 

14 

10 

3 

2 

0.140-0.179 

1 

8 

17 

8 


0.I80-0.2I9 


2 

10 

M 

7 

0.220 and over 

. . . 

... 

1 

0 

5 


There is a marked correlation between A; and A 3 , so a few* groups base 
10 or more reports, whereas 15 out of the 25 have under 5. Preliminary 
examination of the data indicates that a unit change in .Xg is generally 



358 


Special Regression Methods 

accompanied by a larger change in than is a unit change in 
Accordingly we may decide to halve the groups m the central portion of the 
range of X^, making the class intervals with respect to that vanable under 
0 100, 0 100-0 119, 0 120-0 139, 0 140-0 159, 0 160-0 179, 0 180-0 199, 
0 200-0 219, and 0 220 and over With 5 classes for X^, this will give a 
40 group classification— but with many of the “cells” vacant Averaging 
Xi, Xs, and Xi for each of the resulting groups gives means as shown in 
Table 21 3 


Table 21.3 

Haystack Data Average X ^, Xt , and X ^ for Observations Classihed 

BY Xt AND A'j 


Xa Values 

Number of 
Cases 

Mean X^ 

Mean X^ 

Mean X^ 




Xf under 0 090 

Under 0 100 

2 

0 071 

0 055 

0185 

0100-0119 





0120-0 139 

1 

0078 

0124 

0 363 

0 140-0 159 





0 160-0 179 

1 

0086 

0162 

0 394 




Xi 0090-0119 

Under 0 100 

7 

0 102 

0081 

0278 

0 100-0 119 

10 

0107 

0 112 

0 332 

0 120-0 139 

4 

0 106 

0 127 

0 379 

0140-0 159 

2 

0099 

0 144 

0 369 

0 160-0 179 

6 

0 105 

0 167 

0 473 

0 180-0 199 

2 

0103 

0 183 

0459 




Xi 0 120-0 149 

Under 0 100 

3 

0130 

0 085 

0278 

0100-0119 

6 

0 132 

0 108 

0 362 

0120-0 139 

4 

0 130 

0 131 

0 417 

0 140-0 159 

12 

0 129 

0 149 

0 440 

0 160-0179 

5 

0135 

0 171 

0483 

0180-0199 

$ 

0 135 

0181 

0515 

0 200-0219 

2 

0145 

0 208 

0 568 

0 220 and over 

1 

0140 

0236 

0 655 



For Joint Regression Surfaces 


3S9 


Table 21.3 (Continued) 


A'j Values 

Number of 
Cases 

.Mean A; 

Mean A'j 

Mean ,V, 



A'. 0.1 50-0. 179 

Under 0.100 

1 

0.157 

0.077 

0.27-} 

0.100-0.119 





0.120-0. 1 39 

3 

0.161 

0,1 3S 

0.4-:6 

0.140-0.159 

4 

0.159 

0.150 

0.456 

0.160-0.179 

4 

0.163 

0.167 

0.492 

0.180-0.199 

9 

0.161 

0.193 

0.558 

0.200-0.219 

5 

0.167 

0.208 

0.606 

0.220 and over 

2 

0.157 

0.252 

0.719 



'F; 

; 0.180 and over 

0.100-0.119 

1 

O.ISI 

0.119 

0.385 

0.120-0.139 

1 

0.181 

0.139 

0.443 

0.140-0.159 

1 

0.186 

0.144 

0.483 

0.160-0.179 

1 

0.181 

0.1 6S 

0.551 

0.180-0.199 

3 

0.186 

0.187 

0.574 

OJ20O-O.219 

4 

0.198 

0.210 

0.638 

0.220 and over 

5 

0.199 

0.242 

0.724 


Fitting a Joint Function Graphically. The most rapid method of 
getting an approximate shape of the surface is by drawing contour lines 
just as surveyors draw contours in preparing a topographic map. A 
chart is prepared as shown in Figure 21.4, with values of .\\ as the abscissa 
and with as the ordinate. (In making this chart, a sheet of cross- 
section paper ruled 10 lines to the inch, and 16 by 24 inches large, was 
used, in order to enter and read the values with satisfactory- accuracy.) 
A dot is then entered corresponding to average values of ,V» and for 
each group in Table 21.3. The average value of is then written in by 
the dot for each group, and enclosed in parentheses when the group has 
less than 3 cases. Dotted lines are drawn in. roughly separating the dots 
into those hasing X^ values below 300. below 350, etc., .as nearly as possible, 
and leaving intermediate values at corresponding distances between the 
lines where possible. (Values for the lines arc indicated at the end of each 
line.) Solid lines are then draw-n in. corresponding as closely as possible to 
the dotted lines but spaced as regularly as possible across the chart, and 
with similar shapes, so as to give a smooth continuous .surface without 
“bumps,” while conforming to the general shape of the topographic surface 


Value of Xj 


Fig. 21.4. Mean values of Xi for group mean values of Xt and X,, and 
original and smoothed contours fitted to group averages 





For Joint Regression Surfaces 

indicated by the dotted lines. ^(This process of smoothing can be chec'sed 
by reading ofT values of Aj corresponding to successive given salucs of ,V* 
svith JV 3 constant, or of with An constant, smoothing thc^e values 
graphically, and then drawing the corresponding smoothed contours in on 
the chart.) Esen with only' freehand stnooihins bv eve. however, as u^ed in 
drawing the solid lines shown in the figure, reasonably cood results can 
be obtained. 



Fig. 21.5. Estimated values of X{ read from Fig 2t.4 
related to actual mean values of A j, 


Values of Xi read from the freehand solid lines entered in Figure 21.4 
correlate with, the original values of for the group averages as shown in 
Figure 21.5. Apparently there is a close and one-to-one correspondence 
between the original values and the smoothed surface, with a very high 
correlation. 

An alternative method of determining the surface graphically is by 
plotting the values against the values for each set of groups 
according to given values w'ith respect to .Vj. fitting lines or curves freehand 
to each of these sets of data (representing successive intercepts or slices 
across the function), smoothing estimated values read trom these fitted 
lines for given values of acro.ss the other axis, and continuing this 
process until no further changes in the smoothed function seem needed. 
This process, illustrated in detail in previous editions of this book (second 



362 Special Regression Methods 

edition, pages 382 to 387) has little advantage over the contour method 
and IS far sloiver, and therefore has been omitted here 
The shape of the jomt function described by the solid contour lines on 
Figure 21 4 may be seen by reading off values for given combinations of 
A'a and X 3 , and showing these values in tabular or graphic form Table 
21 4 gives them, read from the chart by mterpolation between the contours, 
scaling off the distances perpendicular to the nearest contour 


Table 21,4 

Values of A'l for Specihed Values of and A’g 
Estimated from Contour Lines 


Xi 



Xi 



0 80 

0100 

0150 

0 200 

0 250 

0100 

0270 

0 312 

0 405 

0 487 


ono 

0 280 

0 328 

0 430 

0 518 


0140 

0 286 

0 340 

0448 

0 548 

0 677 

0160 

0 290 

0 349 

0463 

0 578 

0713 

am 



0477 

0600 

0 7J2 

0200 




0617 

0 747 


When these values are plotted on a three dimensional diagram, with 
A'j, as the ordinate, the resulting surface is found to be as shown in Figure 
21 6 The joint function is seen to be almost linear in any one dimension 
but warped upwards, so that the slope corresponding to is much 
steeper at high values of X 3 than it is at low values of X 3 This warped 
surface could not be fully represented by a plane of any type 

In determining the surface shown in Figures 21 4 and 21 6, the data were 
subclassified into a 5 x 8 grouping Out of the 40 possible subgroups, 
1 or more observations occurred m 31 Discarding the one observation 
with Xg = 0 055 and A'g = 0 071, both far below the range of the other 
observations (though with a value of Xi not inconsistent with them), 
there were 30 group averages on which to base the graphic analysis to 
see if there was any consistent indication of the presence of a joint function 
As shown in the two figures, this gave a definite indication of the presence 
of a joint function which could not be represented by adding two net 
functions for X^ and A'g 

In this case, with 119 usable observations, and with high multiple 
correlation so that the individual values of fell close to the joint surface 
(as will be seen shortly), it was possible to use a large number of subgroups 




for Joint Regression Surfaces 

in getting many averages to which to fit the surface. Where ike number 
of observations is smaller, or the multiple correlation lower a '•mille'' 
number of groups would have to be used. Even a 2 x 2 classification w ith 
4 groups, or a 3 x 3 with 9, would scrv'c to give some indication of v. hether 
or not a joint function was present. In such eases, if the group aseraces 
indicate a warped surface, not only should group a\craces be plotted on 
the topographic chart on which the contours arc to be drawn (Ficurc 
21.4), but the original observations should also be plotted, scpanTtcIv 



Fig. 21.6. Estimated volume of A", for specific combinations of .X; anJ 
A'j, from smoothed contours. 


designated, and the contour lines fitted with some reference to the 
individual observations as well as the group averages. This is especially 
necessary if the surface of the function appears to be sharply curved 
rather than linear, as the average of points along a convex or concave 
curve will of course not lie on the cun e. 

Where the number of observ'ations is quite small, but the correlation 
is high, the contours may be fitted directly to the individual observations 
without making averages at all. An illustration of such a problem is 
shown on page 373. 

Fitting a Joint Function Algebraically. In the same way that definite 
equations can be determined by least squares to represent curvilinear 
net regressions, certain types of joint functional surfaces can be 



3S4 Special Regression Methods 

represeated by algebraic equations of types which can be fitted by least 
squares After the kind of surface present has been determined by fitting 
contours graphically and plotting the surface so found, it is then much 
easier to choose an appropriate equation In some cases, the equation 
can be deduced logically and used as a model of the relation expected 
In the haystack-volume problem. Figure 21 6 shows that the joint function 
IS of a type where the regression of Xi on Xg is substantially linear for 
any given value of Ag, but the slope of the regression g changes as the 
values of A'j change If it is assumed that the slope changes at a constant 
rate with changes in X 3 this condition may be expressed in the relation 
A'j = a + b{c + dX^X^ 

Multiplied out, this becomes 

Xi=:ia + beX^ + bdX^Xi 

which may be stated 

A'l = a -h + g(A'g^s) (21 2) 

The values of a, e, and g may then be determined by the usual methods 
of linear multiple correlation, with X^ and the values of the product 
(A'gA'a) used as the independent factors 
If It IS assumed that Xi vanes with X 3 other than through its influence 
on 6^3 2, an additional term may be added to the equation, making it 

X^ = a + eX 3 + g(X 3 X 3 )-¥hX 3 (213) 

Determining the values of the four constants of equation (21 3) from 
the haystack data, and working out estimated values of A\ for specific 
combinations of values of A'j and A'j, we would arrive at substantially 
the same joint functional surface as was determined by the contour 
method 

Other algebraic equations which have shown good ability to represent 
joint functions are as follows 

A, = + b^X^, + *3^3 + btiX^ + A3) 

{This equation is useful with a plane rotated so as to show higher values 
diagonally across the joint surfo/ie ) 

For production functions, where two fertilizer constituents are each 
applied m varying quantities, these equations have proved useful 

Xi = a + b^X^ + /.3A3 + ^,(AD + b^(Xl) + b^iX^X^) 

Xi = a + bgXs + b^Xg + + b^VY^ + b^V X^X] 

In fitting these equations, the values are first determined for the complete 
set of terms as shown, and then when certain of the coefficients show values 



For Joint Regression Surfaces 

insignificant as compared to their own standard errors, the solutions arc 
recomputed omitting these terms. 

The last two and other equations were extensively tested in an experi- 
mental study of the results of applying s^ars-ing quantities of phosphate 
and nitrogen to com, red clover, and alfalfa.^ ''xhe equations used svere 
deduced from the hypothesis that in applying fertilizer in succcssixe units 
of the same size, the law of diminishing returns would Itold. so th.at (I) 
each additional unit of each fertilizer would add less to the vjold than the 
preceding unit, up to some maximum point; and (2) that die combined 
result of two fertilizer ingredients together would be greater than the 
sum of the effect of each one separately. Both equations, one based on 
parabolas and one on square roots, satisfy these two logical conditions, 
and provided two alternative equations to represent the model. (A third 
form not shown here, did not quite meet these two conditions.) 

The authors found that the square-root form of equation cave the 
best fit to the experimental data for all three crops. For red closer and 
alfalfa, certain terms gave results not statistically significant, and the 
final equations therefore eliminated these terms. 

Stating total yield ( Y) in bushels or tons per acre, phosphate (P) in 
pounds of PoOfl per acre, and potassium (K) and nitrogen (,V) m pounds 
per acre, the three selected equations were; 

Com r = -5.682 - 0.3I6A' - 0.417P -b 6.3512V A 

+ 8.5155a/P - f 0.3410v^^ 

Red clover 7= 2.468 - 0.003947F -b 0.0283^% K -b 0.127892 V F 
- 0.000979 ^/7fP 

Alfalfa Y = 1.837 - 0.0014K - 0.0050? -b 0.061731 \''A 
-f 0.173513\/? - 0.001440\ 

The yield-response surfaces described by the equations for com and 
for alfalfa are showm in Figures 21.7 and 21.8. 

With the relations determined in an algebraic equation, it is relatively 
simple to estimate new values of the dependent variable for various 
values of the independent variables, and to determine confidence intervals 
for each constant. Various other mathematical operations can also be 
performed. For example, the authors calculated the “replacement rates 
of one nutrient by another, at various points on the joint surface, and, 
by applj'ing assumed costs and prices of the ses’cral variables, c^ilculatcd 

’ Eirl O. Heady, John T. Pesek, and William G. Brovm. Crop a-sponsc surf.-.c^ and 
economic optima in fertilizer use, /ou-a Srafe CaHr^’C Acrsruilural SjCIh c 

Research Bulletin 424, March, 1955. 



366 


Special Regression Methods 


150 


lool 


50 


Nitrogen (pounds per acre) 

Fig 217 Response of com yields W varying quantities of nitrogen 
and phosphate as shown by fitted joint function (front bulletin by 
Heady er a/ op eit ) 

probable ]east*cost combinations and economic optima m the use of 
fertilizers * 

Determining the Standord Error of Estimote ond index of Multiple 
Correlation With the joint function fitted by an algebraic equation, 
the standard error of estimate and the index of multiple correlation with 
respect to the joint function can be calculated by the usual equations, 
(15 6) and (15 7) Similarly, confidence intervals of any given probability 
around the fitted joint surface can be calculated by an extension of the 
methods of Chapter 17, and the reliability of individual forecasts by the 
methods of Chapter 19 These confidence calculations are, of course, 
applicable only where the observations represent a sample randomly 
drawn from a defined umverse, — the same one for which estimates are 
to be made Where, as in the case of results based on experimental data 
such as those shown m Figures 21 7 and 21 8, the data are selected (expen- 
mentally or otherwise) with respect to the values of the independent van- 
ables, the correlation index has sampling significance only m the very 
special “universe” in which those same values of the independent vanables 
* Earl O Heady, el al ,op at , pp 321-25 



For Joint Regression Surfaces 


367 


i 


i 

i 



0 ' 60 160 2-iO 22: 

KjO (pounds) 


Rg. 21.8. Response of alfalfa yields to sTiryns quantities of 
potassium and phosphate, as showr. by fitted join: function 
ffrom bulletin by Heady e: al , op. at.) 

remain fixed for all possible samples. The standard error of estimate 
may still give a reasonably accurate forecast of errors to be expected in 
new observations drawn from the more general universe, in which the 
frequencies of various values of the independent s-ariables may differ 
from those in the original sample.^ (See Chapter 17.) If the closeness of 
fit for joint surfaces fitted to the same data by several di.ffcrent equations 
is to be determined, either the multiple correlation index or the standard 
error of estimate will sen'e to indicate which ones will gise relatively 
satisfactory results. For such cases it is important to include the adjust- 
ments for degrees of freedom [equations (12.2), (15.6), and fl7.9)] if the 
several equations tested have vaiydng numbers of terms and if n is not %cry 
large — say, 100 cases or less. 

It should be noted that the equation which giscs the highest adju'.tcd 

* Least-squares regression analysis tjpically assumes that the \anar.ct of if 

uniform throughout the universe represented by the regression func::''n. Or. th.s 
assumption, aj is independent of the particular s-alucs chovon for the ir.iepcnicn: 
variables. 



368 Special Regression Methods 

correlation coefficient or index may still not be significantly better than 
one or more alternative equations Some tests of significance in this 
sense, based on analysis of vanance, arc presented in Chapter 23 The 
adjustments of and /merely correct for bias m estimating the correspond- 
ing universe values If the addition of a new variable using one degree 
of freedom increases by more than the amount (1 — R^)l(n — m), 
R} will rise But the increase could still be due to the chance correlation 
of the new variable with the residuals from the regression involving other 
variables m this particular sample In a sample of 20 observations a 
fourth independent vanabJc would have to increase over R\^^^ 

by approximately 3(1 — /^ 234 )/(n — «) to be considered a significant 
improvement at the 5 per cent level of probability 

This more demanding (and scientific) test of significance serves as a 
further curb on the tendency to particularize regression functions too 
closely to what may be random features of one sample While the danger 
is sometimes alleged to be greatest when freehand curves are used, the 
same temptations arise when two or more mathematical equations arc 
fitted to the same set of observations But if an apparent gam has been 
achieved by testing a number of alternative additional terms in the equa- 
tion, or a number of alternative additional variables m the multiple 
correlation, and one of those tested has been found to show an apparent 
gam with odds, say, of 1 9 out of 20 by the tests applied to that one lariable 
alone, then the true odds of it being significant are far less For example, 
if ten alternative additional terms have each been tested m turn, and one 
of them has been found to show a gam with this much significance, the 
true odds for its being significant are not 19 out of 20, but only 10 out of 
20 — since any one of the ten tested might, by chance, have shown the 
same level of significance 

When the joint function has been determined graphically, as in the 
haystack problem, the same general method is followed as with graphic 
multiple regression curves— working out estimated values, X[, from the 
graphic function calculating the z’s and their standard error, and then 
calculating the two statistics from this Performing this process for the 
haystack problem, estimated values for each observation in the sample 
are read off from the contour lines in Figure 21 4, interpolating linearly for 
distances between the adjacent contours The resulting values of X[ 
are shown to the nearest 5 units m the third decimal place m the next-to- 
the last column ofTable2I 1 (More exact values could be read by scalmg 
the distances m millimeters from the original large chart, and this was 
done m reading the values for Table 21 4, but it was felt this would give 
an impression of spuriously high accuracy for the individual estimates— 
though It might have reduced the size of s, slightly) The value of 



For Joint Regression Surfaces 

j69 

~ calculated, as shown in the final column of the t ib’c 

and so is the value of s,. The standard deviation of the orieinal v-lu‘-s'o^' 
Xx is 0.1265, whereas 5 , is 0.0281. In this ease the regression Mnf-. 
accounts for a large part of the variation in volume. Tlte standard error 
in estimating from the joint function may therefore be stated 

— 0.0281 


Similarly, the index of multiple correlation for A', as a joint function of 
Afj and Afg may be computed by the usual equation® 


■^f./(2.3) — 1 


{0.0281)- 
sf~ (0.1265)2 


= 1 - 0.047342 = 0.9526S8 


h./(2,3) — 0.976 


The joint function therefore explains 95 per cent of the obsursed variance 
in volume. 

The sampling significance of the values of /, ;,o 3 ), whether determined 
by graphic or algebraic fitting, is discussed in Chapter 17. 

It is evident that the volume of a round haystack may be very closelv 
estimated from the rough farm measurements of circumference and “ov er." 
The standard error of estimate, 0.0281, indicates that the logarithm of 
volume can be estimated to ±0.0281 of the true logarithm for two-thirds 
of the observations, and to ±0.0562 for 95 per cent of them. Taking the 
antilogs of these values, 0.0281 to T.9719 and 0.0562 to 1.9438, we find 
that the confidence intervals are 93.7 and 106.6 per cent for P — 0.67, 
and 87.9 and 113.7 per cent for P = 0.95. 

Stating the Conclusions Shown by the Joint Function. After the 
joint relations have been determined by either metliod. the regression 
surface may be stated in simpler terms by preparing tables showing the 
expected values of A^i for stated combinations of X. and .V^. In tins 
problem, that involves determining the logarilluris of A', and A'- for tlie 
selected values, reading off from the charts the corresponding estimated 
value for the logarithm of A^j, and finding it.s antilogarithrn. Carrying 
out this process, w'e obtain the values shown in Table 21.5. 

A similar table for the com yields corresponding to Figure 21.7 is 
shown in Table 21.6. 

* In view of the large number of observations, the adjuslnKn": for n and « are 
ignored here. 



370 


Specjof Regression Methodi 


Table 21 J 

Average Volume of Rouvd Haystacks tor Different Combinations of 
Qrcumftrence and ‘ Over” 


“Over” (feet) 


Circumference 

30 34 38 42 



Cubic feet 

Cubic feet 

Cubic feet 

Cubic feet 

60 feet 

1,760 

2 240 



65 feet 

1,860 

2,430 

2,990 


70 feet 

1,910 

2,660 

3,270 

4 070 

75 feel 

80 feet 


2,690 

3,510 

4,470 

4,730 


Table 21.4 

Predicted Total Yields for Specified Combinations of Fertilizer 
Applied to Corn* 


PsOs 


Nitrogen {Pounds per acre) 



0 

80 

160 

240 

320 

Pounds per acre 


Bushels per acre 


0 

-5 7t 

25 8 

24 0 

168 

68 

80 

37 1 

95 9 

105 4 

106 9 

104 1 

160 

35 3 

105 4 

1196 

124 6 

124 9 

240 

261 

104 8 

122 6 

130 4 

133 0 

320 

13 1 

992 

120 0 

130 1 

134 7 


* From Heady et al , op at , p 305 
t As published, charted as zero 

Determining a Joint Function for Three or More 
Independent Variables 

There are several different ways by which joint functions involving 
three or more independent variables may be fitted In some cases it 
may be desirable to allow for the joint influence of two variables while 
simultaneously elimmaUtvg ot holding constant, the net effect of one ot 
more additional independent variables Thus m the corn-yield problem 





For Joint Rc^rcsiion Surfaces ^71 

of Chapter 14 il might be desirable to determine the joint nlmor. of Nield 
to rainfall and temperature, v.hile simultaneously aiic'^ing for the up-said 
tendency in yields during the period studied/ Tins m/lu be don; in 
determining the relation according to the ecncral equation 


■^1 -V.-) -f /^{AV 


C!.4> 


This relation may be -worked out either alcebraicaliv or craphicallv. 
Algebraically, it would simply mean adding one or more appropriate ternn 
inA 4 [/; 4 Ai; or /jjAj, -f or b^X. + hlXj; etc.] to the equations 
such as those previously considered. In view of the larce number of terms 
in the regression equation, it would be especially important to eliminate 
those terms whose coefficients did not show significant values, one bv 
one. and to solv-e the equations and recalculate the saiuc< of the remaining 
constants with those terms omitted. 

If worked out graphically, it may be done by combininc the Graphic 
contour method for two independent variables with the method of succc'.- 
sivc approximations for multiple cursilinear regressions as discussed in 
Chapters 14 and 16. The steps would be (i) to determine the usual net 
regression curves for each of the independent variables according to the 
simpler equation, 

A'j 5= a -b/sfA',) -f ./ijt.Xyl 


and then (2) to subclassify the residuals according to the values of ,V. 
and A'a and enter them on a chart like Figure 21 .4. for the group average.s, 
or for individva] observations if the number was not large enough to 
give satisfactory' group averages. If the chart showed indic.ations of tlic 
presence of a joint relation, contours would then be fitted to them. The 
previous residuals would then be adjusted to take account of this joint 
relation. If the variance had been significantly reduced, the rc'-iduals 
would then be averaged with respect to the remaining independent factor 
AV to see if the net curve for that factor would need to b-e changed no-.v 
that the joint relation to the two other factors had been allowed for. 
This process of successive appro.ximation would be continued until the 
final shape of the curve for Aj,. and the joint surface for A;, .V- had been 
best determined. The final function y^^lA'., .Vj> would be calculated fay 
adding the values for the earlier curvcsyi-lAU and/-{,V;,) to the %ahics from 
the new joint function /or the residuals, for selected values of A~. 

Contour Fitting for Three Independent Variables with a Small 
Sample. An example of three independent variables fitted with a joint 
function to two of them, and with only 16 ob'^ersalion<^. will illustrate 
this process in part. This is based on dat.i relating to tlic yield of potatoes 
and rainfall in Maine, in July and .August. 1 he preliminarx' muUipk 



211 Special Regression Methods 

regression analysis had given a value for trend in yields (JT*) as shown m 
the fifth column of Table 21 7 


Table 21.7 


Weather Conditions and Yield of Potatoes in Maine 


Year 

Rainfall to 
August 1 
(July 
doubled) 

Rainfall 
August 1 
to Sep 
tember 15 

Xi 

Yield 

Adjust- 
ment for 
Trend* 

Yield 
Adjusted 
for Trend 

Estimated 

Yield 

f(X„ X^ 

Residual 


Inches 

Inches 

Bushels 

Bushels 

Bushels 

Bushels 

Bushels 

1913 

13 17 

3 66 

220 

+26 

246 

248 

-2 

1914 

1133 

4 08 

260 

+27 

287 

260 

27 

1915 

15 96 

412 

179 

+31 

210 

229 

-19 

1916 

15 46 

3 77 

204 

+33 

237 

236 

1 

19J7 

17 77 

5 53 

125 

+31 

156 

155 

1 

1918 

18 09 

3 87 

200 

+22 

222 

220 

2 

1919 

12 25 

541 

230 

+17 

247 

248 

-1 

1920 

1329 

7 61 

177 

+15 

192 

196 

-4 

1921 

7 82 

611 

298 

fl3 

311 

323 

-12 

1922 

1640 

5 12 

187 

+ 12 

199 

197 

2 

1923 

10 61 

3 51 

258 

+9 

267 

278 

-11 

1924 

910 

6 13 

315 

+7 

322 

308 

14 

1925 

1130 

5 38 

250 

+5 

255 

262 

-7 

1926 

9 60 

5 60 

290 

+3 

293 

297 

-4 

1927 

13 9S 

602 

232 

+1 

233 

226 

7 

1928 

15 45 

645 

220 

-1 

219 




• Simaltaneously delennmed while allowing for trend SeeF V Waugh Methods of 
forecasting New England potato yields U S Department of Agriculture, Bureau of 
Agricultural Economics, Mimeographed report, February, 1929 


The data are plotted in Figure 21 9, with the yield adjusted for trend 
used as the dependent factor Drawing m contours so as to separate 
years of similar yields, we find that a very peculiar type of surface is 
indicated — one that changes elevation very rapidly between the combina- 
tion of high early rainfall and low late rainfall, and high early rainfall and 
high late rainfall When these results are used to forecast the yield in 1928 
(which year, it will be noted, was not plotted or used in determining the 
contours) a yield of about 180 bushels is indicated This is only in fair 
agreement with the final yield of 219 bushels, determined several months 
after the climatic data were available to ^ve the forecast stated 
Reading off the estimated values for each year shown, the estimated 



373 


for Joint Regression Surfaces 

adjusted yields as shown in the nexWo-thc-last column of Table 21.7 arc 
obtained. The standard deviation of the residuals, shown in the next 
column, is J0.6 bushels, whereas the j of the yield adjusted for trend is 63.0. 
If five constants arc assumed to be neccssaiy to represent the surface 
mathematically, the .standard error of estimate would be 13.0 bushels and 
the index of correlation for the surface indicated by the contours \sou!d he 
0.98. If it is assumed that the trend line was fairly accuratclv projected, 
the standard error of estimate indicates that an error as great as that in 
1928 would be likely to occur only very rarely. The fact of high correlation 
and of low standard error for the period covered could be judged from tb.c 
closeness with which the contours fit the individual observations on the 
contour chart, Figure 21.9, in the same way that closenc.ss of observations 
to the regression line indicates high correlation in the ease of simple 
correlation. 



3 4 5 6 7 8 9 

RainfaSt Aug. 1 to Scp‘_ 1. .V3 


Fig. 21,9. Yield of potatoes for years of .specified 
rainfall before August 1 and after August 1, and 
contours fitted directly lo the data. 

In this ease, the new residuals after adjusting for the joint function 
(last column of Table 21.7) do not indicate that any further shift in the 
net trend line is needed. 

Determining Joint Functions in k YariabJes. Given a large enoug 1 
sample, and sufficiently high correlation, it should be possible to determine 




374 Specio/ Regress/on Method: 

joint functions m three or more independent variables Such a relation 
would be expressed by the equation 

(215) 

Such a complex relationship would be far too elaborate to represent on 
any single diagram, though a solid diagram could be drawn for three 
independent variables by “transparent” drawing (or the use of several 
colors) which superimposed several different surfaces, for different values 
of Xi, on a solid diagram such as Figure 21 6 
An approximation to a full complex joint function could be made by 
using several subordinate joint functions m one regression equation 
Thus if corn yields were to be explained by five independent variables, 
A'a = rain in July, X^ = temperature in July, X^ — rain in August, 
A'g = temperature in August, and X^ = time, the yield might be well 
explained by a set of relations represented by the equation 

2^1 =/23(^t -^3) +/4tiU ^t) +Am (2i 6) 

Such a relationship, involving two sets of joint functions and one single 
net regression function, could be fitted by an extension of the methods 
already discussed, either by solving for an appropriate algebraic equation, 
or fitting by successive approximations by the method of graphic approxi< 
mations and contour fitting for each of the two joint functions 
The more complicated case indicated by the more general equation 
(21 5) would involve even greater difilculties Where an appropriate type 
of equation has been deduced to fit a given joint function, it can be 
expanded to represent as many more variables as the conditions of the 
case require and within the limits of the availability of the data to deter* 
mine the necessary constants The arithmetic labor becomes increasingly 
heavy, and unless the sample is very large and the correlation quite high, 
many of the constants will show values without statistical significance 
These can then be dropped, and the equations solved for the terms which 
are statistically significant For three independent variables, for example, 
the square root equation used in the corn-fertilizer example would 
become 

A*j = n -j- 62A2 b^X^ + ^4^4 + X^ ■+■ b ^ -|- bf'V X^ 

+ bsVXiXa + b^VXiXi + X^X^ (21 7) 

Obviously this ten-constant equation could be fitted effectively only 
to a very large number of observations Similar expansions can be made 
for the vanous other types of jomt functions described For more than 
three independent variables, they would become very cumbersome indeed 



375 


For Joint Regression Surfaces 

To determine the genera! function (21.5) graphicallv for even three 
independent variables would require a very large nuntber of ohvcrwstions'. 
since a threefold classification would be needed. If only 4 croups were 
used for each variable, 64 subgroups would be pos<;ibie. Not un!c^.s there 
were sufficient obscn'ations so that. say. 3 to 5 might fall in each suheroup. 
on the average, could such a relation be deternT^ineJ with anv decree of 
accuracy, unless the correlation \sas very high indeed. If the joint cor- 
relation were perfect, one ease to a subgroup would be suflicicnt to 
indicate the nature of the function. SomcwJiat similar requirement*; would 
apply if the surface were fitted algebraically to obtain significant values 
of the many constants. 

With three independent variables, successive smoothing in tha*c 
dimensions w-ould be involved. The process might be simplified by- 
dividing the obsers-ations into several groups according to one variable, 
determining the functional relation to the other two independent variables 
separately for each group, and then smoothing the results for the different 
groups together to determine the change in joint function with ch.angcs 
in the first variable. 

Figure 21.10 illustrates some results of this sort, fora four-dimensional 
joint function. These results were obtained from an analysis of 190 
observations of sales of individual lots of apples. The records were first 
separated into those for each of the 5 sizes ofapplcs, and the joint functional 
relation of price to amount of insect injury and amount of scab determined 
separately for each size. Tiicse results were then sntoothed between 
apples of diflerent sizes, to make the “surface"’ of the imaginary four- 
dimensional solid diagram show a gradual continuous change over 
every' dimension." 

The four-dimensional relationship 

A j — f{Xn. A 3. ,\4) 

may be visualized by a composite diagram,® as illustrated in Figure 
21.10. This shows that the presence of defects reduced the price of large 
apples much more than the price of small ones, and that either defect 
alone reduced the value of apples of any size materially, whereas both 
defects togctlier reduced the price only slightly more than one alone. 

• niis illustration is from .an .an.alysis supplied by Frederick V'. W.augh. For a mtnc 
c).abor.atc study of the same type, see John R. Raeburn. Joint corrci.ition ap plied to the 
quality and price of Mdntosh apples. Cornel! University AgrtcuUuni! rw|-v-ri!T5cnt 
Station Memoir 220. March, 1939. 

It is also possible to show the rcluions on one di.igrarn by superimposing the four 
surfaces one on top of the other, and drawing all four as if ihcy were transp.'ucr.t shecss. 
This can be done more elegantly by making a solid model, with the four 
represented by plastic sheets, each of a different color. 



376 Spec/al Re^ressfon Methods 

The only limitation to the number of variables which could be con- 
sidered jointly is the number of observations available Where u is 
possible to determine joint relation, that affords a veiy satisfactory state- 
ment of the relationship, since the real relation is not obscured by assump- 
tions hidden m the regression equation used 



Fig 21.10 Average price of apples of given sizes, for various combinations 
of amount of insect injury and amount of scab 

as tha accuracy of itei rsgrcssiO!! ct!n<ss iS NtSaoiJcad by the A'strt' 
bution of the observations along vanous portions of the curve and the 
standard error of estimate, so the reliability of a joint regression surface 
(such as that shown \n Figure 21 10) would depend not only on the standard 
error of estimate, but also on the number of cases falling within each por- 
tion of the area Where the joint regression surface is determined mathe- 
matically, Its reliability can be estimated by an extension of the equations 
presented in Chapters 17 and 19 Methods of estimating the standard 
errors of a surface determined graphically have not yet been developed 



For Joint Regression Surfaces 
Summary 


27 } 


This chapter has developed means by uhich the relation between one 
variable and two others operating jointly may be determined, either where 
no other variables are concerned or where one or more additional inde- 
pendent variables are taken into account. Methods arc also discussed 
for measuring the influence of three or more independent variables operat- 
ing jointly; but the increased number of observations nccessaiy for such 
determinations restricts the field of usefulness of this ij-pe of analwis 

REFERENCES 

Ezekiel, Mordccai, The determination of cufMlinc.ir regression "surfaces” in the 
presence of other sariables, Jour. Amer. Stai. Assoc., \'o! XXT, pp. 3I0-3:0, 
September, 1926. 

Waugh, Frederick V., The use of isotropic lines in determining regression surfaces. 

Jour. Amer. Star. Assoc., Vol. XXIV, p. 144, June, 1929. 

Court, Andrew T., Measuring joint causation, Jour. Amer. Stal. Assoc , Voh XXV, 
No. 171, pp. 245-254, September, 1930. 

Bruce, Donald, On possible modificatior.s in the Ezekiel method of curvilinear multip’e 
correlation. Typewritten manuscript, filed in the Library’, Agr. Mkig. Scr\ , U. S. 
Dept. Agr. 

Earl 0. Heady, John T. Pcsek, and William G. Brown. Crop response surfaces .md 
economic optima in fertilizer use, /mvo State College A^r. Lxpt. Sta. Res. But. 
424, March, 1955. 



CHAPTER 22 


Measuring the tvay a dependent 
variable changes with changes in 
a qualitative independent variable 


It IS sometimes necessary to determine the change m one variable 
associated with changes m a qualitative independent factor, Le , one 
which vanes in ways that cannot be measured quantitatively Thus if 
one IS studying the effect of various factors affecting the values of individual 
farms, one might wish to include the kind of road on which the farm was 
located Yet different kinds of roads, such as arterial highway, all-weather, 
gravel, or dirt cannot be stated m the same way that the measurements for 
continuously variable factors can 

Measuring Simple Correlation with a Qualitative Variable. Where 
a Single qualitative factor is to be considered as an independent variable, 
and a quantitative factor as the dependent, the regression relation may be 
determined by grouping the observations according to the category of 
the qualitative factor, and calculating the average value of the dependent 
factor for each group 

The intensity of correlation between the quantitative dependent factor 
and a qualitative independent factor is measured by the correlation ratio., 
which corresponds to the correlation coefficient or index in that it measures 
the proportion of the variation m the dependent factor explained by its 
association with the independent factor Using lo represent the number 
of cases m each successive group according to the independent factor, 
and Mf, to represent the mean of the dependent variable Y in each such 
group, the correlation ratio j;„ is defined 



378 



For a Qualitative Independent Variable 

The calculation of may be illustrated ucing some of the data elvers 
in Table 22.1. in ihh table shows the way eggs arc packaced fo:'<.zk. 
in 3 types; and A', is the price of eggs per do?.en. The fjrst step h to 
average the prices for each tj-pe of sale, with the following results; 


Method of Sale, 

A's 

A — Sold without carton 
B — Sold in carton, but 
unbranded 

C — Sold in carton with 
brand name 


Kumber of 

Avcrrjgc Pric^ for That 

Cases 


17 

y* ^ 

Cents per dnren 

47.94 

33 

52.91 

24 

52.0S 


Apparently, eggs in cartons sold on the average for more than ccc^ not 
in cartons, but in this ease adding a brand name to the canon did not 
appear to make much difference in price, and if anything, brought a lower 
price. 

The calculatiotts arc then performed as follows: 


A's Group 

Number of 
Cases 

V Y 


= (Af„kX 

A 

17 

815 

47.941 

39,071.92 

B 

33 

1,746 

52.909 

92.379.11 

C 

24 

1,250 

52,083 

65,103.75 

Summations 

74 

3,811 

51.500 

196.554.78 


To calculate we also need the value nsf. which is calculated separately 
to be 3,228.50. 

Equation (22.1) then becomes 


V,x 


I’lUnaiMl) - / 196, 554.78 - (51. 500)(3.S1 1) 

” ~ 3,228.50 

-i 


288.28 

3.228".50 


A'^0.0893 so = 0.30 


The correlation ratio, 0.30, indicates that there is some relation between 
the kind of package and the price, but a rather low- one. with the relation 
c.xplaining only 9 per cent at best of the variance in egg prices from store 
to store. (If adjustment were made for n and m by the usual formul.a. 
with m = the number of groups used, the relation would be still lower.) 



380 


SpecJfll RegressJon Methods 
Measuring Multiple Correlation with One or More Qualitative 
Independent Variables, and Other Quantitative Variables. In the egg 
case, other variables such as quality, weight, and color of the eggs may 
also influence the price, and more or less obscure the true relation to the 
method of packaging Such relationships may be explored by an extension 
of the method of multiple correlation This involves first solving for the 
relations to the quantitative vanables, then calculating the unexplained 
residuals, studying their relation to the qualitative variable or vanables, 
and then readjusting the relations found with the other independent 
variables after allowing for this relation While this could be done either 
by using algebraic equations or by successive graphic approximations, 
the method will be illustrated here, using the method of successive approxi- 
mations for both qualitative and quantitative variables 
The data given m Table 22 I ate from a study of the relation of various 
quality factors to the retail price of eggs ^ The factors shown in the 
table are A'j, an index of the interior quality of the eggs, A'g, the weight 
of each dozen m ounces, Xf, the number of white eggs in each dozen, 
Xi, the type of carton the eggs were sold m, and A'j, the pnee per dozen, 
in cents Net curvilinear regressions have been determined for the three 
quantitative factors by the successive approximation method, and estimated 
prices have been worked out by the regression equation 

+/tiX^+A(Xs)+A{Xt) 

The residuals, s", obtained by subtracting these estimated prices from 
the observed prices, X^, are also shown m the table 
Determining the Net Influence of the New Variable. The first step 
m determining the net regression of Xi on X^ is to group the residuals 
from the previous curves, z”, according to the new factor A'g, and deter- 
mine the average for each group This gives results as follows 

Value of X^ Average of z’ 

A — no carton —2 5 

B— carton +0 9 

C — carton and brand name +0 6 

These results show that, after making allowances for the size, color, and 
quality, eggs in unmarked cartons sold 3 4 cents above those sold m 
bulk, on the average, but those with branded cartons sold only 3 1 cents 
above eggs in bulk (This contrasts with an average difference of prac- 
tically 5 cents a dozen in the simple averages, before the effects of the 

• Charles B Howe, Some local market pnoe characteristics which affect New Jersey 
eggproducers, factorsinfluencingtheretailpnceofeggs N J Agr Expt Sta Bui 150 



for a Qualitative Independent Vcrioble 


Tabtc 21.1 

Data for Egg Proble?^ \saTH a NoN-QuAvriTA-mT LvoFi-n,D5-.-T %'AHiAr.u 


Independent Variables 

Dependent 

Variable, 


/(AV) 



•^3 

^4 


2! 

23 

4 

C 

35 

-7.3 

■fO.6 

-7.9 

35 

24 

12 

C 

45 

-S.4 

+0.6 

-9.0 

26 

23 

12 

B 

55 

3.4 

+0.9 

7 < 

27 

24 

12 

B 

55 

3.3 

+0.9 

2.4 

3! 

22 

12 

A 

50 

-l.S 


0.7 

35 

24 

12 

C 

44 

-9.4 

+0.6 

-1 0.0 

28 

23 

12 

C 

60 

8.2 

+0.6 

7.6 

41 

23 

12 

B 

50 

-4.8 

+0.9 

-5.7 

28 

26 

2 

C 

45 

-1.6 

+0.6 


24 

23 

11 

B 

52 

4.6 

+0.9 

3.7 

28 

20 

12 

C 

45 

-5.5 

+0.6 

-6.1 

49 

24 

12 

C 

55 

-3.6 

+0.6 

-4.2 

30 

24 

12 

C 

55 

2,4 

+0.6 

1,8 

48 

23 

12 

B 

60 

1.9 

+0.9 

I.O 

19 

22 

9 

C 

45 

1.8 

+0.6 

1.2 

22 

23 

3 

A 

45 

1.7 

-2.5 

4.2 

33 

25 

12 

C 

60 

6.6 

+0.6 

6,0 

26 

24 

12 

C 

59 

6.9 

+0.6 

6.3 

35 

23 

12 

B 

55 

2.1 

+0.9 

1.2 

20 

23 

12 

B 

50 

-0.9 

+0.9 

-l.S 

25 

25 

12 

B 

55 

2.6 

+0.9 

1.7 

46 

24 

12 

B 

60 

2.5 

+0.9 

1.6 

30 

26 

1 

B 

45 

-3.2 

+0.9 

-4.1 

24 

24 

12 

B 

55 

3.1 

+0.9 

2.2 

48 

23 

12 

B 

60 

1.9 

+0.9 

1.0 

17 

22 

12 

C 

55 

4.8 

+0.6 

4.2 

18 

22 

12 

A 

45 

-5.3 

-2.5 

-2.S 

41 

24 

12 

C 

55 

-0.3 

+0.6 

-0.9 

30 

25 

12 

C 

67 

14.0 

+0.6 

13.4 

19 

24 

2 

B 

53 

S.3 

+0.9 

7.4 

47 

24 

0 

B 

55 

0.9 

+0.9 

0.0 

32 

24 

12 

B 

55 

2.2 

+0.9 

1.3 

26 

24 

12 

B 

49 

-3.1 

+0.9 

-4.0 

38 

24 

12 

A 

42 

-12.2 

-2.5 

-9.7 

29 

23 

12 

B 

42 

-9.9 

+0.9 

-lO.S 

24 

24 

0 

A 

45 

-2.9 

-2.5 

-0,4 

37 

25 

12 

A 

40 

-14.3 

-2.5 

-11.8 


* A designates “sold without carton ” B “sold ir. carton but unbranded.’' and C “told 
in carton with brand name." 


382 


Special Regression Methods 


Table 22-1 (Contffiued) 


Independent Variables 

Dependent 

Variable. 

2" 

/m 

- 




^5* 

36 

23 

12 

A 

48 

-5 1 

-2 5 

-2 6 

10 

23 

0 

B 

47 

1 2 

+0 9 

03 

35 

24 

12 

C 

59 

56 

+0 6 

50 

22 

22 

12 

B 

52 

1 2 

+0 9 

03 

29 

21 

12 

B 

S5 

40 

+0 9 

31 

16 

23 

0 

B 

40 

-6 5 

+0 9 

-74 

6 

22 

3 

B 

40 

-10 

+0 9 

-1 9 

31 

23 

12 

B 

55 

28 

+09 

19 

26 

23 

12 

B 

55 

34 

+0 9 

25 

36 

21 

12 

B 

60 

78 

+0 9 

69 

39 

22 

12 

B 

55 

1 4 

+0 9 

05 

42 

23 

12 

B 

60 

48 

+0 9 

39 

36 

24 

12 

C 

60 

64 

+06 

58 

47 

22 

12 

B 

60 

28 

+0 9 

I 9 

27 

24 

J2 

C 

55 

2S 

+0 6 

22 

31 

22 

12 

A 

50 

-I 8 

-2 5 

07 

26 

22 

11 

A 

40 

-7 2 

-2 5 

-41 

45 

23 

12 

A 

60 

35 

-2 5 

60 

18 

25 

12 

C 

45 

-6 6 

+0 6 

-7 2 

35 

24 

12 

C 

50 

-3 4 

+0 6 

-4 0 

21 

23 

12 

c 

55 

40 

+0 6 

34 

44 

23 

12 

A 

60 

39 

-2 5 

64 

48 

24 

12 

A 

55 

-3 6 

-2 5 

-1 1 

33 

24 

12 

A 

55 

20 

-2 5 

45 

47 

24 

12 

C 

55 

-3 1 

+06 

-3 7 

16 

22 

5 

A 

45 

39 

-2 5 

64 

32 

25 

0 

B 

50 

08 

+0 9 

-0 1 

45 

25 

12 

B 

55 

-2 4 

+09 

-3 3 

46 

23 

12 

B 

57 

00 

+0 9 

-09 

32 

24 

12 

C 

55 

22 

+0 6 

1 6 

16 

23 

1 

c 

41 

-4 2 

+06 

-4 8 

30 

25 

1 

c 

50 

23 

+0 6 

1 7 

24 

22 

0 

A 

42 

-5 0 

-2 5 

-2 5 

44 

24 

11 

B 

50 

-2 6 

+0 9 

-3 5 

25 

22 

12 

B 

49 

-2 1 

+0 9 

-3 0 

16 

23 

0 

A 

45 

-1 5 

-2 5 

10 

31 

24 

8 

A 

48 

32 

-2 5 

57 


• A designates “sold without carton,” B “sold in carton but unbranded,’ and C “sold 
m carton with brand name ” 



For a Qualitative Independent Variable 

other variables were allowed for.) These results cannot he accepted .n 
the final eficct of package on price without first raisiun the question 
whether the curves previously determined to sho\>. the intlucncc of th.c 
other factors might be changed somewhat were the t\pe of package t.tkcn 
into account. Whether this will be true or not depends upon whether 
there is any correlation between the new factor and the factors prcMoinh 
considered, or whether they arc quite independent of each other. Thh 
can be determined by sorting the other factors according to the \ahjc-. 
of A's, and determining their averages for each group. The results arc: 


Value of A '5 

Averages of Other Independent Variables 

Number 


A', 

A'a 

•v, 

Cases 

A — no carton 

30.6 

23.1 

8.6 

17 

B — carton 

31.6 

23.2 

9.6 

33 

C — carton and brand 

29.9 

23.8 

10.2 



of 


There does seem to be some correlation between A%. and the other 
variables. Apparently the eggs sold in unmarked cartons arc, on the 
average, of the best quality and of medium size; the eggs sold in cartons 
under brand names are of larger size, but are not of such high quaiiiy, 
on the average; whereas tho.se sold in bulk average medium in quality 
but low in size.- Accordingly, the curses prcviousl) determined for the 
change in price with differences in size and in quality may liasc included 
some portion of the effect really associated with cartons instead. Now that 
at least an approximate measure has been obtained of the influence of 
carton on price, the previous curves may be modified by taking this factor 
also into account. 

Taking Account of the Non-Quantitatlvc Variable in Estimating 
Xj and z. In Table 22.1 in the column headed /(A'-l the approvimatc 
influence of differences in carton on price are entered, the aserages found 
in the tabulation on pngc 380 being used. Since these values would be 
added to the previous estimated values of X; to obtain the new estimates, 
they are subtracted from the previous residuals (e ) to obtain the revised 
residuals. The last column shows these new values, e"'. Before using 
these new values to see if any changes are necessary in the other regression 
curves we may first determine how much the standard error of estimate* 
has been reduced by taking A '5 into account. Tiiis could be determined 
directly by computing the standard deviation of the* new c values, 
but a much shorter method is available, using the same principle 


* The exact correlation between A'j and A'-, 
by use of the correlation ratio. 


,Vj. and .V,. rc^peem ely, can lx: compmed 



384 


Special Regresihn Metbodt 


m calculating the correlation ratio By the use of this method, the j,. 
may be computed from the by the formula 


4 -- 


n 


( 222 ) 


The necessary computations are 



M - 

Number of 



Xs 

(=Mo) 

Cases (= no) 

flgMo 

«o(Wo)' 

A 

-2 5 

17 

-42 5 

106^ 

B 

09 

33 

29 7 

26 73 

C 

06 

24 

14 4 

864 

Sums 



16 

14162 


W,. = ^ = 00216 
74 

So 

^ 74(506)^ -(141 62 - 0 04) 


r,. so 4 87 


Computing the standard error for estimates based on and the other 
variables, we must recognize that the value of m has been increased by 
three by the introduction of the new factor, so, whereas m was assumed 
to equal 8 previously, it now equals II Adjusting the values of 5 06 for 
St- and 4 87 for by equation (15 6), we find ^i/( 2 S 4 ) = 5 36, and 
■^1 /(2 3 4 5 ) = 5 27 Apparently the introduction of as a factor has had 
as yet but slight effect on the accuracy with which egg prices might be 
estimated 

Making Further Successive Approximation Corrections. It is still 
possible that the regressions for the other factors might be modified 
now that X^ has been approximately allowed for Consequently the 
values of e"’ are classified according to the values of X^, Xg, and Xt, 
and the averages computed for each group The averages m Table 22 2 
suggest that the curve for/^X^ might be modified slightly, so as to nse 
more steeply m the portion up to Xg — 36 and less steeply thereafter 
Table 22 3 does not indicate any consistent relation between Xg and 2 “, 
so no further change in /g{X^ is indicated Table 22 4 indicates that the 
curve for/ 4 (X 4 ) might also be altered slightly, so as to have a somwhat 
steeper slope 

were modified as suggested, a new estimated value 



for a Qualitative Independent Variable 


3SS 


Table 22.2 


Average Values of for Correspon-djvg X, Values 


Xi Values 

Number of Cases 

Average of X. 

A'.c.Mge of e" 

0-14 

2 

S.O 

-0.9 

15-19 

9 

17.2 

-0.2 

20-29 

23 

25.1 

-0.1 

30-39 

24 

33.5 

+0.3 

40-49 

16 

45.5 

-0.1 


Table 22.3 

Average Values of i' for Corresposulng .V, Vaeltis 

;ir 3 Values 

Number of Cases 

Average of e” 

20 

1 

-6.1 

21 

2 

5.0 

22 

13 

0.1 

23 

23 

0.2 

24 

25 

-O.I 

25 

8 

0 

26 

2 

-3.2 


Table 22.4 

Average Values of e*" for Corresponding X ^ Values 

Xi Values 

Number of Cases 

Average of Xi .Average of 

0 

7 

\ 

o 

1- 2 

5 

1.4 -O.A 

3- 5 

4 

3.S +0.2 

8-11 

5 

10.0 -0.5 

12 

53 

12.0 -0.2 


38 $ Special Regression Methods 

of A'l might then be worked out, using these new curves and the previous 
curve for/sfA's), and using the values for/5{A'5) already entered m Table 
22 1 The new z's based on these new estimates might then be classified 
with respect to ATj, to determine if any change need be made m the values 
for/5(A'5) worked out on page 380 If any material change were found 
necessary m A'j. the residuals might be corrected accordingly, and then 
averaged with respect to and X 4 , to see if any further changes would 

be needed in their values This process of successive approximation 
should be continued until no further significant change was indicated m 
any of the curves, or until the /(2346) showed no further reduction 

It does not seem worth while, m this problem, to carry out the additional 
steps just outlined In a problem where the non-quantitative factor is 
an important one, however, and where it is significantly correlated with 
the other independent variables, the determination of the net function 
for that factor should be earned through a sufficient number of approxi- 
mations to measure the final net effect of each factor as accurately as 
possible 

Taking the preliminary results shown in Table 22 2 as the final measure 
of the influence of type of container on price, it appears that eggs sold 
in an unmarked carton brought on the average, 3 4 cents more per dozen 
than eggs of the same quality, size, and color sold m bulk, and 0 3 cent 
more than eggs sold m a carton with a brand name (This last result might 
reflect the experience of consumers with branded eggs of poor quality 
as indicated in the tabulation on page 383, which might tend to make 
them sell at a discount even when they were of equal quality ) The improve- 
ment in closeness of fit may be measured by the slight reduction m the 
adjusted standard error of estimate, or by the increase m the index of 
multiple correlation Computing the adjusted indexes of multiple cor- 
relation corresponding to the standard errors of estimate before and after 
the type of carton is allowed for, by equation (17 9), /1234 = 0 59, 
and A 2345 = 0 62 The corresponding indexes of determination, 35 and 
38 per cent, indicate that taking into consideration the differences in the 
carton has increased the proportion of egg prices which can be explained 
by 3 per cent of the original vanance, after due allowance is made for the 
additional constants introduced 

When the stricter criterion of “significant improvement” based on 
vanance analysis is applied (see Chapter 21, pages 366 to 368, and Chapter 
23, pages 403 to 41 1), however, it appears that this increase could have 
been due to the chance charactenstics of this particular sample The 
same percentage increase m explained vanance based on 200 observations 
rather than 74 would have been significant m the sense that it would be 
expected to occur by chance in only 1 sample out of 28 of the larger size 



3B7 


For a Qualitative Independent Variable 

The first approximation to the repression on non-qu 3 nti!ati'.c fac!>ir, 
can be made directly from the residuals from the linear multink rcerc^sir-n 
equation, instead of waiting until after approximate regression curve-, 
arc determined for the other factors. In case a non-quantitatise facu'm 
is a very important one, it may be roughly included in the net linear 
regressions by designating successive groups* by a numerical code whicii 
approximates the expected influence of the variable. Then if the true 
influence is of a differcnl order from the expected influence, that fact 
v/ill show up when the first approximation curves are worked out. {['or 
the non-quantitative factor the averages of residuals must be interpreted 
as discrete points for each class, however, rather than as a continuous 
function.) Thus it might have been tentatively assumed that eac; in 
branded cartons would sell above eggs in unbranded carion>-, and both 
would sell well above eggs in bulk. The bulk eggs could then have been 
designated by I ; the unbranded cartons by 3; and branded cartons by 4. 
The net linear regression would have been positive; but the residuals 
would have revealed that eggs in branded cartons really aseraged lower 
in price (other factors equal) than in unbranded cartons. 

This technique is also useful in time-series analysis to determine tlic 
net seasonal movement (as in a price scries), while simultancou.sly allowing 
for the influence of other variables. For monthly data, this would involve 
12 groups, one for each month. Smoothing the values tiirough the year 
in a continuous curve would reduce somewhat the number of degrees of 
freedom used up. 

Where there are two or more qualitative factor.s involved, the meiltods 
presented here may be extended to deal with all of ihcm simultaneously 
by successive approximations. 

Summary 

Where an independent factor is not a continuous variable, but may be 
classified into two or more groups, the regression of a dependent factor 
on it may be determined with respect to each group, and measures o! 
regression and simple correlation can be calculated. Where other inde- 
pendent factors arc also involved, net relations with the qualitative factor 
or factors may also be determined, while holding other factors constant 
by the usual multiple correlation process. Standard errors and indexes 
of multiple correlation may be worked out to include the eUcets of non- 
quantitative independent factors as well as for continuously variable 
factors. 



CHAPTER 23 


Cross-classification 
and the analysis of variance 


Introduction. In earlier editions, cross-classification and averaging 
was presented as a method of analysis that stopped short of formal 
multiple regression but embodied the basic idea of net regression lines or 
curves In the present edition the concept of net regression has already 
been presented on a mathematical basis, the “dnft lines” of the short-cut 
graphic method have also been presented as approximations to the final 
net regression curves When large numbers of observations are available, 
the idea of studying the relationship between Xi and Xg within each of 
a number of subclasses of Xg is intuitively obvious Most data of the 
census type are presented in the form of cross-tabulations, so that frequen- 
cies and average levels of a “dependent” variable can be obtained for 
different combinations of values {usually classes or ranges) of two or 
three “independent” variables Average incomes of workers cross- 
classified by age and years of schooling would be an example These 
are all quantitative factors In addition there may be non-quantitative 
subdivisions — the first three vanables may be available for male and 
female workers separately and for each state or region 
Sometimes simple methods are overlooked, so the first section of this 
chapter will present an example of analysis by means of cross-classification 
and averaging The remainder will present some basic principles of the 
analysis of vanance and discuss its relationship to regression analysis 
While analysis of vanance has been developed primarily in connection 
with agricultural and biological experiments, its use has spread over the 
whole range of experimental smences including, m recent years, some 
applications to the social sciences For example, the formulas used in 
Chapter 22 to estimate the effects of a qualitative independent vanable 
are based upon vanance analysis concepts 
388 



3S9 


Cross-Classification and Analysis of Variance 
Analysis by Cross-Classification and Averaging 

Analysis by averages where there are two independent variables invoKcs 
classifying the records first by one variable, then breaking each of the 
resulting groups into several smaller groups according tolhe vaji-cj of 
Ihesecond variable. If a third independent variable were to be considered, 
these groups would be broken up into still smaller group.s. according to 

Tabic 23.1 

Cross-Classification of Reports According to Sizr of Farm and Size 

OF Dairy Herd 


Sire of Daif)' Herd 


Size of Farm 

Under 6 Cows 

6 

6 

u 

o 

12 Cows and 0\cr 


Acres 

Cows 

Income 

Acres 

o 

U 

Income 

Acres 

Cows 

Ir.co.Tie 

1 

Niim- 

Num- 


Num- 

Num- 


Num- 

Num- 



her 

her 

Dollars 

her 

ber 

Dollars 

ber 

b:r 

Dd'r.-s 

f 

... 


... 

SO 

6 

610 

60 

IE 

991 

50-99 acres I 

. • • 

. . . 

. . . 

« « . 

. . . 

. . » 

70 

17 

1,020 

. . . 

. . . 

. . , 

; ... 

. . . 

■ • 

90 

12 

J.V) 

1 

• • • 

. . . 

. . . 


. . . 

... j 

• 

SO 

15 

?'.X» 

Total 

. . . 

• . . 

. . . 

. . . 

. . . 

1 

r&.t 

62 

mm 

Average 

' • • 

• • • 

• • • 

80 

6 

610 

75 

1.<S 


' ■ 

120 

1 

590 

100 

9 

900 

110 

12 

E,-0 

100-149 acres 



... 

110 

6 

740 

i 120 

15 

1.0:3 

. 

• • • 

— 


* • ' 


‘ • • 

no 

16 

1,130 

Total 

. . « 

. • . 


210 

15 

1,640 

.SJO 

43 

3,010 

Average 

120 

1 

590 

105 

7.5 

820 

113 

14.3 

1,030 

150-199 acres | 

160 

0 

700 


6 

820 

lEO 

14 

1,260 



1 


7 

860 

160 

12 

9.“0 

Total 

. . • 

• . • 

. . * 

330 

13 

1.680 

340 

26 

2,24.0 

Average 

160 

0 

700 

165 

6.5 

840 

170 

13 

1,1 :o 

200 acres and j 

220 

0 

830 

240 

7 

960 




230 

2 

760 







over 1 

220 

2 

760 







Total 

670 

4 




960 

i 



Ascrage 

223 

1.3 

783 


7 

1 



































390 


Special Regression Methods 

the values of the third variable Then the values of the dependent variable, 
as well as each of the independent vanables, would be averaged for each 
subgroup This process is known as subclassification or cross-classifi- 
cation 

Cross-Classification for Three Variables. In the problem presented 
in Chapter 10 there were two independent variables— number of cows and 
number of acres The records would therefore need to be classified into 
groups both according to the number of cows and the number of acres 




Farins of j 



/^150lol99 BC es 

/ /I 

Ove 200 acres 





yy 




/ 


99 unt 

/ 







Average number of cows 


Fig 23 I DilTerence in average income wiih difference 
in number of cows for farms grouped by size of farm 


on each farm Since there is such a small number of records the groups 
should not be made too small Let us take three groups for cows, and 
four groups for the size of farm, or twelve possible groups in all The 
records are classified into twelve groups, and totals and averages computed 
for each, as in Table 23 I 

None of these groups has a sufficient number of farms represented to 
make the averages particularly significant, yet even so a certain regularity 
in the averages can be observed In each column the average income 
increases as the size of farm increases, though there is but little difference 
in the average number of cows from group to group, similarly across 
each line of averages the income increases as the number of cows increases, 
though there is but little difference in the average size of farm from group 
to group The average incomes from Table 23 1 are charted in Figures 
23 1 and 23 2, first for differences m the number of cows with farms of 
similar sizes, and then for differences in the number of acres, with farms 
of similar numbers of cows 




39 ! 


Cross-CIcssification and Analysis of Variance 

Bolh figures show the tendency for income to incrca^- -.vith :,n jnerc--*.- 
in the independent variable, when the effect of the other \:iriab!c'ie held 
fairly constant by the grouping process. In Figure 23.1 the lines sho\vnb.vji 
the same general slope for each of the four groups, though there arc ^omc 
irregularities. Figure 23.2 similarly shows about the snrne cenrral chan'-' 
in income with a given change in the sue of the farm, no m.mter wh.aiTs 
the number of cows; but here the irregularities from group to croup arc 
more striking. 



Fig. 23.2. Difference in average income uiih difference in 
number of acres, for farms grouped by numbers of ccw-s. 


Individual differences as large as those shown might be due simply 
to random differences in sampling and therefore ha\c no real mcan'ng. 
But when a series of small samples shows consistent trends.- as these do. 
they are more meaningful than eacii taken separately. 

Although the subgroup averages show the general effect of changes 
in one variable, such as cows, upon income, with i!;c effect of tin- other 
variable, acres, removed, they cannot be considered to siiow the specific 
effect of specific differences. For example, much more c-.tdcnce would 
be needed to prove that, between 75 and 100 acres, a change of 1 acre h.as 
much greater effect upon income on farms with C to 1! cows than on 
farms with 12 cows or more, even though the lines tn Figure 23.2 appe.ar 
to indicate this. Ail that is really proved is that on farms of both numbers 
of cows there is a tendency for income to increase with an increase tn 
the number of acres. 

The averages in Table 23.1 may be summarized tor publicatton in .i 
form similar'" to Table 23.2. The number of cases represented ir. each 
average is included to prevent the reader from placing an undue a.moun: 




392 Special Regression Methods 

of confidence in an average based on a small number of observations 
In addition, the reader should be given some indication of the approximate 
standard error of each average 

The very small number of cases included in each of the groups is 
stnkingly brought out in Table 23 2 Even if there were five times as 
many farms to deal with— 100 m all— if they were distributed m the same 
manner, the largest group would have only 20 cases, and all the rest would 
have 15 or less, which, under ordinary conditions, would be hardly enough 
for really significant averages This subsorting technique is most useful 
with census-type material in which many of the individual cells contain 
100 or more observations 


Table 23.2 

Differesc£ in Average Income for Farms of Different Sizes and 
With Different Sizes of Dairy Herd 


Size of Farm 

Under 6 Cows 
m Herd 

6 to 1 1 Cows 
m Herd 

12 Cows or Over 
m Herd 

Size of 
Group 

Average 

Income 

Size of 
Group 

Average 

Income 

Size of 
Group 

Average 

Income 


Number 


Number 


Number 



of farms 

Dollars 

of farms 

Dollars 

of farms 

Dollars 

50 to 99 acres 



1 

610 

4 

895 

100 to 149 acres 

1 

590 

2 

820 

3 

1.030 

150 to 199 acres 

1 

700 

2 

840 

2 

1,120 

200 to 249 acres 

3 

783 

1 

960 




Average Differences between Matched Subgroups. After the 
observations have been grouped and averaged as shown in Table 23 2, 
average differences in the dependent vanable (as here, dollars of income), 
with given differences m each independent vanable, can be roughly 
determined while holding constant the other independent variable or 
vanables This involves determining the average differences between 
the averages for the dependent variable for matched groups The com- 
putations are shown in Tables 23 3 and 23 4 
From these results it appears that increasing the number of cows from 
under 6 to between 6 and II, without changing the size of farm, was 
accompanied by an average increase of $182 Increasing the cows further 
to over 12 cows was accompanied by a further increase of income of 
$258 Similarly, increasing the size of farm from under 99 acres to 
100-149 acres, without changing the number of cows, was accompanied 





353 


Cro3S‘Classlficatlon and Analysis of Variance 

Table 23.3 


Change rN Average Income Betaveen Groues Matched for Si?.c of Fkfm 



A 

B 

C 

D 

t 

Size of Farm 

Under 

6 to 

Increase 

Over 

Increase 


6 Cows 

11 Cows 

(B~A) 

12 Cows 

{D - Ts) 

Acres 

Dollars 

Dollars 

Dollars 

Dollars 

Dollars 

SO- 99 


610 


895 


100-149 

590 

820 

230 

1.030 

210 

150-199 

700 

840 

140 

1,120 

2f0 

200-249 

783 

960 

177 



Average change with cows 



182 


2SS 


Tabic 23.4 


Change in Average Income Between Grouts Matoied tor 
Number of Cows 


Number 
of Cows 

A 

50-99 

Acres 

B 

100-149 

Acres 

C 

Increase 
{B- A) 

D 

150-199 

Acres 

£ 

Irere.Tse 
[D - B) 

r 

200-249 

Acres 

a 

lr.c'e.'sr 
(F- D) 


Dollars 

Dollars 

Dollars 

Dollars 

Dollars 

Doha's 

Dollars 

Under 6 

, . . 

590 


700 

no 


83 

6to 11 

610 

820 

210 

840 

20 

9f0 

120 

12 or over 

895 

1,030 

135 

1,120 

SO 



Average change 

with acres 

* ♦ » 

, . 

173 

' ' ' 

73 


102 


by an increase of S173 in income. A further increase to 150-199 acres 
was accompanied by a further average increase of S73 in income, and to 
200-249 acres, by S102 more income. (In this discussion “mcrcasc" 
in size or cows has been used to designate differences between results for 
farms of different sizes or with different numbers of cows.) Thc^c rough 
measurements of differences in the dependent variable with differences 
in one independent variable, while holding a second independent constant 
by subsorting, may be compared with results obtained by the mere exact 
methods set forth in earlier chapters,^ 

' In computing Tables 23.3 and 23.4, no attention was paid to '.wightinp ire res,'.! ^ 
aotording to the number of cases falling in each group, or to the ' 2 ir.pl. ng rehabdus c. 
each average. For a discussion of the first of these points tcc page 39S. 


394 Special Regression Methods 

This same method may be applied to get the average difterencc between 
matched subgroups, where two or more other independent variables arc 
held constant by the grouping 

Limitation of Cross-Oessificotion for Many Variables. This small 
example illustrates one fundamental difficulty with the method of sub- 
classification and averaging — the large number of cases required for 
conclusive results Though there are only two independent variables 
involved, and the records arc classified into only three groups one way 
and four the other, apparently 100 cases or more would be required for 
really significant results If it had been desired to subclassify the records 
according to two more additional variables — say number of men employed 
and number of hogs kept — that would have greatly increased the number 
of records necessary If each of the groups already shown had been 
further subdivided into 3 groups for men and 4 groups for hogs, that 
would have increased the number of possible groups to 108 Where 
over 100 records would have been needed in the first case to give results 
at all reliable, probably 500 or more would be needed with this further 
classification 

The method of subclassjfication and averaging has further short- 
comings, It provides no measure of how important the relation shown 
is as a cause of vanation m the factor being studied or of how closely 
that factor may be estimated from the others on the basis of the relations 
shown The method of subclassification and averaging thus does not 
determine the relationships where many variables are involved so satis- 
factonly as does multiple regression 

Fitting Regressions to Group Means Sometimes, particularly with 
census data or with sample surveys based on a thousand or more cases 
we may have access to the cel! averages and the number of observations 
m each cell but not to the original observations or to estimates of standard 
deviations or variances within cells If the dependent variable and one 
or more of the classifying vanables are quantitative, an obvious way of 
summarizing the data is to fit simple or multiple regressions to the cell 
means (We use actual cell means of the classifying variables if they are 
aiadabJe, other»'ise midpoittis of class intervals ) 

If the cell frequencies are fairly large, the cell means will show much 
less variability than would the individual observations Correlation 
coefficients between two senes of cell means will sometimes be very high 
Also, we can calculate standard errors of regression coefficients, treating 
each mean as a single observation But the sampling significance of these 
measures is by no means clear tm/ess we have some idea as to the variance 
of the original observations 

The effects of grouping can be illustrated roughly as follows Suppose 



Cross-Classipcathn and Analysis of Variance 


m 


we have a regression based on 900 original observations, and find tbai 
5 ^^ = 100 and r- = 0.20. The variance per obscrsation’aWu! ihc b'ne 
would be == 10,000 and the e.xplained variance 2.500. .Now. supnn^e 
the observations arc grouped into 9 class intervals of A' with lOo'ohscr.a'- 
tions in each interval. The “standard error of estimate" about a renressinn 
line fitted to the 9 group means will now be on the order of 10 rather than 
100. while the standard deviations of A' and of >" wiH be about the sum- 
as before. Thus, the unexplained variance (per group mean) iv about 
100: the explained variance remains about 2.500. and r= between the 
group means is about 0.96. It is important to note that 0.96 is not a 
measure of the “success” of the analysis in the same sense as a similar 
value obtained in relating original observations. If one has any Knovvlcdcc 
at all concerning the order of magnitude of the standard deviations W 
original observations in a particular .study or even in other studies of the 
same variables (such as family income and consumption of particular 
foods), one can convert the standard error of a regression coefficient 
based on group means to a rough estimate of the standard error tiiat 


might have been obtained from ungrouped data. One or two such 
calculations would provide a rough basis for appraising the rcliabihtv 
of other regression coefficients derived from the same survey or ccunu- 
Certain other refinements may be needed to obtain results comparable 
to those from ungrouped data. If the variability of the dependent variable 
within each cell is assumed constant but the numbers of observauons m 
different cells vary a great deal, the situation will be improved if tlic 
squares and cross-products involving each cell mean are '.‘cigiucd bv the 
number of observations in the cell. Further, if we knew that the vanance 
of the dependent variable differed from cell to cell accoiding to some 
function of the dependent variable, vve might (1) attach less weight to the 
classes having greater than average variability; or (2) traniform the 
original means into ioearithms or other functions that should make the 
residual variance roughly the same about each section of the regression 
line. 


The Analysis of Variance 

In earlier chapters vve have distinguished between tlie “corrckation 
model” and the “regression mode!” with particular reference to the 
different interpretations of the correlation coefficient in tiie two ca'C*: 
We have noted that regression coefficients and standard errors of estimate, 
however, have the same interpretation in both models. 

The correlation model was originally applied to non-evperiment.ii 
data — random samples from a universe in which ail vanabkw were 



394 Spec/o/ fiegressfon Mtthodi 

normaHy distributed The rcgrcsston model grew out of applications in 
which the values of the independent variables could be selected as part 
of an expenment Particularly in connection with agricultural and 
biological experiments, there develoiKd an extensive theory of expen 
mental design Part of this theory related to situations m which the 
independent factors were qualitatively rather than quantitatively different 
For example, does one vanety of wheat have a significantly higher yield 
than another'' Is one insecticide significantly more effective than another'' 
Regression analysis is not adapted to such cases However, out of the 
problem of comparing the effects of qualitatively different “independent ’ 
factors grew a very powerful tool called “analysis of vanance ” More 
over, the theory underlying analysis of vanance is sufficiently general to 
include regression analysts as a special case In this section we shall 
consider only certain aspects of variance analysis that are closely related 
to regression problems 

Basic Principles of Variance Analysis The sum of squares of a set 
of observations about their mean can be represented as the sum of two 
independent sums of squares— specifically, in simple linear regression 
analysis, a sum of squares of deviations of regression values from their 
mean and a sum of squares of deviations about the regression line We 
have also noted that the first sum of squares has a smgle degree of freedom, 
and the second has n — 2 degrees of freedom, n being the total number 
of observations to which the regression line is fitted 

In discussing tests of significance, we introduced the t ratio, or ratio 
of a coefficient to its standard error For the simple regression coefficient, 
this ratio can be written as follows, {from equations (5 5) and (17 1)) 


b Siy S,, 


Sxy 



Sxy Vs*® 


Say 


(23 1) 


In this form, the r ratio is used to estimate the probability that an observed 
value of b might have been obtamed by chance in random sampling from 
a population m which the true regression coefficient was zero 
A closely related measure is basic to the analysis of variance "A 



Cross-Classification and Analysis of Variance ^97 

variance” is equal to a sum of squares divided by the appropriate number 
of degrees of freedom. In simple regression, the sum I'v- can be divided 

into two parts or components, and Ze’—that is. 

I.y‘ = h‘^Zx^ + v-e 

The first component has a single degree of freedom and the second hav 
n — 2. degrees of freedom, so the corresponding variances arc 
and 2, or It is intuitively plausible that a measure of tltc 

significance of the regression relationship between y and r can be based 
upon the ratio of these two variances. 


Substituting b = m this expression we obtain 

(Lxy)- "Lt- i'Zxij)- 




F: 




S:.r{Zx-) 


(23.2) 


It will be noted at once that F is equal to t-, or t — Vf. 

The equality t — Vf holds only when the total sum of squares is 
divided into two, and only two, parts. This occurs in simple linear 
regression analysis and in testing the significance of the dilTcrcnce between 
two means, two standard deviations, and so on. But the f-ratio is also 
applicable to situations in which a total sum of squares is divided into 
three or more components. The question asked in each case is as follows; 
Could the variances under comparison have been obtained by random 
sampling from the same population? This question is .answered, of 
course, on the basis of probability rather than certainty. Vaiues of F 
covering a wide range of paired values of degrees of freedom base been 
tabulated by Snedecor and are reproduced here with the kind permission 
of Dr. Snedecor and the Iowa State College Press. 

Applications of Variance Analysis: (!) Difference between the 
Effects of Two '‘Treatments.’’ The following data are taken horri the 
study used earlier by Heady, Pesek, and Brown:- Twenty-eight plots ot 
ground are selected at random in a field and are planted to com of .a 
single variety. Eighteen of the plots receive no fertili?cr. and 10 receisc 
nitrogen fertilizer at the rate of 40 pounds N per acre. The yield of cerr. 
obtained on each plot is carefully measured. Our problem is to discos cr 
whether the application of fertilizer at the specified rate lias a signil.caat 
effect upon com yields. 

* Earl O. Heady, John T. Pcsek, and William G. Brown, Crop respo-sr ■'ctf.'.ar' .-io.l 
economic optima in fertilizer use, Iowa State Collej^e Agricuftura! Exv-'-frc-.i Stai.v, 
Research Bulletin 424, March 1955, p. 330. 



5 Per. Cent and 1 Per Cent Points tor the F Distribution 
5 per cent m light face I per cent m bold face 


Specie/ RegressJon Methods 


39fi 




Tabic 23.S (Continued) 


Cross-Classification and Analysis of Variance 


399 


o 


o 


c o 

a ^ 

£ 


c 

o 

V 


er- 

o 

c; 


05 






o ,2 


o u 


w-> 

to 

•or 

O' 

^"0 

r- 

o 

VC 

95 

95 

Cn 

00 

NC 

o 

ro 

in 

C?^ 

>0 

rv 

c^ 

to 

C?' 

ri 

to 

ri 

W 


O' 

C 

90 ! 

ri 

CO 

ri 


ri 

ri 


ri 


CM 


ri 





•* 




O 

•rr 

VO 

»o 

rr 

r4 

fM 


VC 

O' 

s 

05 

VC 

q 

s 

nC 

r«* 

r^ 

'C 

VC 

m 

o 

vS 

c- 

90 

P 

VC 

n 

r4 


ri 

cn 

ri 

ri 

ri 

ri 


CM 


ri 

“ 

ri 

•" 




■— 

1 

to 

VO 

00 

Cv 

cn 

tn 

CO 

ri 


O' 

q 

ri 

O' 

fO 

O' 

in 

m 

5C 

r 

cO 

OC 

r- 

'C 

ri 

CC 

VC 

vC 

C 

90 

NO 

i 

o- 1 

95 

fS 


r4 

ri 

ri 


ri 

ri 

ri 

CM 


CM 


CM 


ri 


ri 



2.69 

4.16 

cn 

to 

ri 

o 

00 

ri 

rj; 

ri 

3.55 

CO 

ri 

cr; 

cc 

CM 

ri 

R 

cn 

q 

ri 

V* 

ri 

5 

ri 

V5 

VC 

ri 

to 

2.56 

VO 

90 

VC 

r, 

r< 


?M 

ri 

o 

o 

O 

t 

O' 

O' 



«o 


vO 

or 

r 

o 

OJ 





1 

^ i 



VO 

On 

rr 

O 


in 

f*% 

cn 


q 

C 

sc 

O 

r 

O' 

tn 

90 

n ! 

ri 

-T 

ri 

c^ 

CM* 

cn 

CM 

cn 

ri 

cn 

ri 

ri 

ri 

r5 

ri 

ri 


ri 


ri j 

s 

j 

«o 

o 

o 


O' 

ON 

_ 


to 

VC 

r- 

r- 

cc 


fO 

95 

CO 

C> 

to 

t 

“o ! 

00 

V5 



«o 

oc 

to 

r- 

T 

in 

ri 



Cs 


SC 

o 

'C 


n 

ri 


r4 


r4 

ri 

r4 

r> 

ri 

cn 

CM 

r> 

rj 

CM 

ri 

ri 

ri 

ri 


ri 

O 


to 

VC 

rr 

o 

VO 


O 

t*^ 

ri 

r 


Cn 

Cn 

«#» 


e- 


! 

o 

00 

CO 

•T 


CM 

VC 

o 

VC 

95 



rr 

CM 

ri 

M 

— 

cr. 

— 

*- 1 

rn 


CM 


<N 


CM 


ri 

cn 

ri 

fO 

ri 

r\ 

ri 

m 

ri 


ri 

ri » 

5 

! 


VC 

VO 

C^ 

to 


r^ 



o 

rr 

o 

to 


o 


c 

A 

r( 

! 

*7 


o 

O' 

VC 

cc 



ri 

r* 


to 

t-* 

T 

in 


**c 


fM 

ri 

w * 

ro 

V? 

CM 

■-T 

n 


CM 


CM 

-T 

ri 

r; 

(M 

rn 

r) 


ri 


CM 

cn ‘ 

j 

NO 



ro 




00 

r- 

r% 

o 

fM 


cn 

vC 

CM 

nC 


90 

l 

•• ‘ 

CN 

rr 


o 

o 


CN 

in 

oc 


\C 

o 

vC 

05 

«o 

r 

•O' 

vn 

fO 


cn 

v5 

cn 

in 

CO 

■4 

ri 


ri 

rf 

ri 


ri 

cn 

ri 

fO 

ri 

rn 

ri 

! 

1 

Cs 

in 


VC 


<7\ 

VC 

CN 

o 

t 

ri 


•t 


Cn 


C 

00 

., , 

3 ! 


Cn 

CO 

m 

r4 

CM 


o 


O' 

CN 

tn 

p-- 


r- 

fM 

r* 

q 

nC 

t 

CO 

in 

CO 

in 

CO 

»n 


in 

m 


ri 


CM 

* 

ri 


ri 

fO 

ri 

r. 1 

i 

i 


cn 


3 


O 

to 


cr 

in 

rj 

c* 

fO 

oc 

95 

'C 

CN 

r? 

X 

t 

iM 1 

00 

c> 

r- 

in 

NO 

CM 

to 

o 


00 

rr 


ri 



q 

q 




fO 

VO 

CO 

VC 

cn 

VC 

«o 

VC 

rr, 

m 

<o 

m 

rr 

n 

CO 

in 


■* 

r" 

“ j 

' 

to 

r*- 

c*^. 

m 

o 

VO 

VC 

oc 

CN 

r^i 

in 


00 

CM 

to 

rr 

o 


VC 

in 

sc 

O 


rr 

o 

r 

or 

c 

V- 

90 


**5* 

ev 


oc 


00 


00 

•r 

95 


r* 

-r 

r 

” 


rr- 

VC 


<0 i 

J 




o 

r 4 


o o o 



AOO 


Spec/fl/ fie^ressfon Metfiodj 


The values resulting from the expenment are tabulated below, the 
symbol Y being used to represent an actual yield and y a deviation from 
the mean yield for a specified group 


Item 

Treatment I 
No Fertilizer 

Treatment 2 _ 

Nitrogen Total Treatmenu 

(40 pounds per acre) ^ 

Observations (n) 

18 

10 


28 

Mean yield (A/) 

26 6S 

60 66 


38 7964 

sy* 

14,79471 

43,982 22 


58,776 93 

/iM* 

12,784 00 

36 796 36 


42,144 56 

w 

201071 

7,185 86 


16^632 37 

Degrees of freedom 

17 

9 


27 

The analysis of variance for these data is as follows 


Source of Variation 

Degrees of 
Freedom 

Sum of 
Squares 

Mean 

Square 

Total 


27 

16 632 37 


Difference between treatments 

1 

7,435 80 

7,435 80 

Variation within treatments 

26 

9,196 57 

35371 


The total sum of squares about the mean of the 28 observations is 
16 632 37 The sums of squares about the individual means are 2,010 71 
and 7,185 86, totalling 9,196 57 The difference of 7,435 80 (equals 
16,632 37 — 9,196 57) is due to the departures of the two group means from 
the mean of the entire set of 28 observations According to the logic of 
the experiment, this term is an estimate of the effect of the difference m 
treatment — nitrogen versus no nitrogen — accorded to the two groups 
of plots The variation within each group is presumably due to a large 
number of factors of essentially random incidence independent of the 
fertilizer treatment 

As It happens, the sum of squares within groups exceeds that ansing 
Cwta tbA w. waw* sb/sw*, 

that the effect of the difference m treatments per degree of freedom is 21 02 
times as large as the corresponding vanation within groups Referring 
to Table 23 5, we find that for degrees of freedom Oj = I and = 20, an 
f-ratio greater than 8 10 would occur by chance only 1 time m 100 
experiments if the treatment actually had no effect upon com yields The 
probability of obtaining an F-ratio as great as 21 02 if the treatment 
were ineffective is negligibly small, we conclude therefore that nitrogen 



40} 


Cross-Classification and Analysis of Variance 

fertilizer applied at the rate of 40 pounds per acre has a highly sianificant 
effect upon com yields. 

The expected value of F would be 1.0 if the difference in trcatnicr.!>: r.ad 
no systematic effect upon yields. For in that cvcni the vari.ince of the 
group means about the general mean would reflect the same random 
forces as does the variance of each subset of observations about its croup 
mean. The classification of observations on the basis of the amount of 
nitrogen applied w’ould then be irrelevant: it v.ould not contribute 
significantly to an explanation of the variance of all the observations about 
the general mean. 

The analysis-of-variance table presented above suggests certain obvious 
analogies with regression analysis. The total sum of squares itas been 
partifioned into “explained” and “unexplained” components. If the ivo 
treatments were two different varieties of com we could say that -5 per 
cent of the observed variation in yields was attributable to the difference 
in varieties. In the current instance the treatments can be quantified 
(as zero and 40 pounds of nitrogen respectively) and v.e can actually fit a 
regression line to the 28 observations or draw a freehand line between 
the two points (26.65, 0) and (60.66, 40) with a slope of 0.85025 bushels 
per pound of nitrogen. 

The student may gain a more complete understanding of the relation 
between variance analysis and regression analysi.s by actually fitting the 
ret^ression line just mentioned. The basic data are given in Table 23.6. 

The discontinuous values of X are quite typical :n controlled e.xperi- 
ments. The required values are as follows: 

Sy= 1,086.3 M., = 38.'’9e4 

'ZX— 400 3/. = 14.28.57 

S YX = 24,264 LJ- = 1 6.000 


From these we find that 


24,264 - 1,5518.56 
16,000 - 5,714.28 


0.85025 


the same value obtained by simply drawing a Tm 
previously indicated. The constant a is given bj 


through the 


a = 38.7964 - 0.S5025fl4.2857) 


points 


= 26.65 

This is the mean of the 18 obsen-ations 

fertilizer (A' = 0). When A' = 40. F = 26.6o + 40(0.85025) - 60.f. 



402 


Special Regression Methods 


Table 23 6 


Corn YreLos Related to Quantity of Nitrogen Fertilizer Applied 


Com 

Yield 

y 

Nitrogen 

Fertilizer 

Applied, 

X 

Corn 

Yield, 

y 

Nitrogen 

Fertilizer 

Applied 

X 

Bushels per acre 

Pounds per acre 

Bushels per acre 

Pounds per acre 

24 5 

0 

32 4 

0 

62 

0 

27 4 

0 

26 7 

0 

S3 

0 

29 6 

0 

179 

0 

22 1 

0 

23 9 

40 

30 6 

0 

11 8 

40 

44 2 

0 

60 2 

40 

219 

0 

82 5 

40 

120 

0 

962 

40 

34 0 

0 

80 7 

40 

37 7 

0 

81 1 

40 

34 2 

0 

51 0 

40 

38 0 

0 

79 5 

40 

35 0 

0 

39 7 

40 


Source Heady et al , op cii 


the mean of the 10 observations for plots which received 40 pounds of 
nitrogen 

Referring back to the analysis of variance table, it is clear that the 
variation within treatments, is identical with the sum of squared deviations 
about the regression line and the variation due to the difference between 
treatments is equal to that “explained” by the regression line The sum 
of sqaares about the general mean has 27 degrees of freedom As the 
regression line must pass through this general mean, only one additional 
degree of freedom ts used op rn defermtmfig the s/ofe of the cegeesston 
line, b Hence the 26 degrees of freedom shown for “vanation within 
treatments” check with the 26 degrees of freedom left in the residuals 
about a least-squares regression line fitted to 28 observations 

Application* of Vononce Anolysis* (2) Differences among the 
Effects of Three or More “Treatments.” Suppose now that, in addition 
to the 28 plots already noted, we have applied 80 pounds of nitrogen per 
acre to 10 other randomly selected plots We can again use variance 
analysis to answer a question that is more appropriate where the treatments 




Cross-Cfassiflcation and Analysis of Variance 

arc quaUtativcly dilTcrent : Do variations in the amount of nitror.cn apnh'cd 
have a significant effect upon corn yields? ' * ^ ' 

The appropriate analysis of variance in this case is as fo!lov,s: 


Source of Variation 

Degrees of 
Freedom 

Sum of 
Squares 

Mean 

Squ.irc 

Total 

37 

46.476,79 


Differences among treatments 

2 

24.28S.06 


Variation within treatments 

F = 

35 

12,144.03 

633.96 ~ 

22,188.73 

633,9{;, 


Referring to Table 23,5, we note that for and tu = 30 decrees 

of freedom an F-ratio greater than 5.39 w ould be expected to occur by 
chance only I time in 100 experiments if in reality there were no relation- 
ship between com yields and amounts of nitrogen applied. 

The analogy with regression analysis can also be extended to this 
case. We could fit a second degree parabola precisely to the three points 
(26.65, 0), (60.66, 40), and (86.62, 80). 86.62 bushels being the mean yield 
obtained on the 10 plots which received 80 pounds of nitrogen. The 
“variation within treatments” would be identical with the sum of squared 
deviations about this parabola, and the difTcrence between this amount and 
the sum of squared deviations about the general mean would be “accounted 
for” by the parabolic relation between group-mean yields and quantity of 
nitrogen applied. This analogy could obviously be extended to any 
number k of groups receiving quantitatively diffcrcni “treatments” of a 
single factor if we are willing to conceive of a A th-order parabola passing 
through the k points of group means. (The significance of such a parabola 
would be no greater than that of a series of straight lines connecting eacii 
successive group mean to the one preceding it. as all the yield observations 
would be concentrated at the k discrete levels of fcrtilircr application.) 

Applications of Variance Analysis: (J) Testing the Significance of 
Additional Terms in a Simple Regression Equation. Suppose that we 
have the data presented in the first and third columns of Table 23.7; v,e 
assume for present purposes that the X and 1' values represent only 9 
individual observations. We have fitted succc.ssh ely a straight line and a 
second-order parabola, and wish to determine whether the nev> term in 
A'“ has significantly increased the proportion of variance attributable to 
regression. 

The least-squares regression line and parabola arc respectively: 

Straight line: Y' = 56.37 -f 0.19792,V 

Parabola; )'' = 34.54 -f 0.66598.V - 0.0014627 .V" 



404 


Special Regression Methods 


Table 23.7 


Relation between Average Yielm of Corn Obtained and Quantities of 
Nitrogen Ferttuzer Applied 


Nitrogen 

Applied, 

X 

Number 

of 

Plols 

Average Yield 
of Corn, 

Y 

Sum of Squared 
Yields 

sy»» 

Pounds per acre 

0 

18 

Bushels per acre 

26 65 

14,79471 

40 

10 

60 66 

43,982 22 

SO 

10 

86 62 

88,022 40 

120 

10 

103 59 

123,820 19 

160 

18 

104 09 

218,485 05 

200 

10 

97 66 

117,912 18 

240 

10 

101 79 

124,570 93 

280 

10 

10603 

133,183 85 

320 

18 

105 27 

222,421 15 


Source Heady et at , op at 
* Based on the individual observations 


The following values are also obtained 


22' « 

1,440 

A/, = 

160 

21' = 

792 36 

A/, = 

88 04 

22r* = 

326,400 

Ml = 

36,272 

2y* = 

75,681 3122 

ErA: = 

145,777 60 

2:rx^ = 

33,669,728 




These result in the following adjusted sums of squares and cross-products 
Sjt® = 5921 94 
Zyx = 19,000 
Syi* = 4,929,248 

The variance in yields explained by the straight-line regression may be 
calculated as 0 19792 ^yx = 3,760 48, and that explained by the parabolic 
regression curve may be computed from equation (12 2) as 

0 66598 (Zyx) - 0 0014627 = 12,653 62 - 7,210 01 = 5,443 61 




Crois-Ctasslficalion end Analysis of Variance ^05 

The difference between either of these values and would be the sun 
of squared deviations around the corresponding regression functio'' Hbt!^ 
the difference between 5,433.61 and 3,760.4s''is The addaion.al sanance 
explained by the parabolic regression. 

The analysis of variance to determine the significance of the new term 
in X” in the parabolic equation is conducted as follows; 


Source of Variation 

Degrees of 
Freedom 

Sum of 
Squares 

Total 

<S 

5.92 1. 94 

Linear regression 

I 

3.7&0.4S 

Additional variation accounted 

for by parabolic regression 

I 

l,6S3.I3 

Variation around parabolic 

regression 

F = 

6 

l,6S3.l3 

79.72 

47S.33 


Mc.in 

Square 

?. 760 . 4 '^ 


79.2 


Referring to Table 23.5 for //i = I and n, = 6 degrees of freedem, v,e 
find that the probability of a value of F greater than 13.74 occurring b\ 
chance would be less than O.Ol (I per cent) if the regression of vicld upon 
the additional term in the universe were really zero. It follows aho that 
the parabola gives a significantly better representation of the relationship 
between Y and X than does the straight line. This process could he 
e.\tended to determine whether a third-order parabola gives a s’cnificanily 
better fit than the second-order parabola already fmed. 

Applications of Variance Analysis: (4) Testing for Curyilincarlty 
of Regression. Let us now take account of the fact that each 3' value 
in Table 23.7 is an average of lO or 18 observations. .Assume that v.c have 
fitted a straight line to the 9 mean values, weighting each of the 9 corre- 
sponding values of V, Y-. X, A'*, and XY by the number of observ.atmns 
(lO or 18) represented by the particular group mean. As a result of the 
weighting pattern, this line differs slightly from that of the prcc:d;nc 
section, the new equation being 

y'= 52.I15I4 -f 0.21220SA' 

The variance due to regression is obtained by computing the squared 
deviations of the 9 possible values of about the genera' rnear; and 
weighting each of them by the number of individual cases (lO or IS; 
located at that particular value of A'. 



406 Special Regression Methods 

The analysis of variance for the current example is as follows: 



Degrees of 

Sum of 

Mean 

Source of Variation 

Freedom 

Squares 

Square 

Total 

113 

242,707 


Withm groups 

105 

149,366 

1.422 53 

Due to linear regression 
Deviation of means about 

I 

61,676 

61,676 

linear regression 

7 

31,665 

4.523 57 


Refernng to Table 23 5, we note that for o, cs 6 and = 100, a value of 
r greater than 2 99 would be obtained by chance only 1 time m 100 
samples if the true regression in the universe were linear 

The logic of the above test may need some clarification The variance 
within groups is the “pooled” variance of individual yields about their 
respective group means This variance presumably results from random 
factors independent of the quantity of nitrogen applied If the group 
means deviated from the regression line in consequence only of these same 
random factors, the expected variance of the means about the regression 
line would be equal to the variance of individual yields about the group 
means ® In this case, then, the expected sum of squares due to deviations 
ofgroup means about the regression line would be or 7(1,422 53) ** 
9,957 71 Their actual contnbuiion was 31,665, or 3 18 times the expected 
value 

As such a value would be highly improbable under our present assump* 
tion, we conclude that other, non-random, factors are primarily responsible 
for the deviations of group means about the regression line This is 
equivalent to saymg that the expected mean values of V for given values 
of X in the umierse follow some (unspecified) curvilinear pattern We 
might then proceed to fit an appropriate type of curve to the data 

Applications of Variance Analysis: (5) 5fgn//icanee of Two or More 
^'Principles of Classification.** In certain types of experiments, we ate 


’ Note again that a variance is a sum of squares divided by the appropriate number 
of degrees of freedom While the group means would cluster more closely around 
the regression line than would the individual observations according to the relation 


. the squared deviation of eadi group mean from the regression line would be 


multiplied by n, the number of observations on which it was based, to arrive at the 
contribution of their deviations to the total sum of squares On the average, then, each 
mean might be expected to contribute approxunately (J,)' to this sum of squares But 
two degrees of freedom are taken up in fitting the regression line to the 9 group means, 
so the exact expected contribution of the group naearvs would be 7(s,V 



Crois-Classificotion and Analysis of Variance <07 

interested in determining whether each of two or more factors is signifi- 
cantly associated with a given variable. In some cases the factors arc 
ualiiativc rather than quantitative-for example, we might base .' 
varieties of com and 3 diflcrcnt insecticides, and apply each insecticide to 
each variety. A complete experiment might then in\nl-,c 9 ploK or some 
multiple of 9. We wish to analw-C the 9 observed yields to determine 
whether either or both of the two “principles of cla5sirication”-difrcrcn«s 
in varieties and differences in insecticides— arc significantly associated witit 

differences in yields. . 

Some assumed data corresponding to such an expenment arc shown m 

%hc total variation is the sum of squared deviations of the 9 individual 
vnclds about the general mean. Af,. = 85.078. The variation due to 
differences in insecticides is measured by the squared devnations o. tae e 
croup means in the right-hand column about the general mean. (Eacn o. 
fl L squared deviations must be multiplied by 3. as each group mean 
rcnrcscnts 3 individual observations). The vanation due to diitcrcnccs n. 
varieties is measured by the squared deviations of the 3 group means ,n 
the bottom row about the general mean, multiplied by 3 as befo.c. The 
computation of these values will be left to the student. 

The analysis of variance for this example is as follows. 


Source of Variation 


ToUil . 

DifTcrenccs among insecticid 
Differences among varieties 
Residual, or “interaction” 

For insecticides; 

For varieties: 


Degrees of 

Sum of 

Mean 

Freedom 

Squares 

Square 

s 

n.0l2.?8 


O 

4.973.42 

2,486.7! 

:> *- 

n 

4.92S.94 

2.4«a7 

4 

1,110.02 

277.50 

P = 8.96 

^ " 277.50 



2,464.47 „ 



f ' - S.8S 

^ 277.50 




Retcrriig to Table 23.5. we note that f”' J tiotet 

freedom, valoee off larger than 6.94 wood o^urb. 

i„ ,00 

associated with com yields. „r 

(f = 0.01). F would have to be WM orjar,^^ ^ 

experiment arc significant at ih p 

level for each of the two factors. multiple reitression with two 

The above example is analogous T-rnlv carii of 3 difierent 

independent variables. One could, o. course, .ipnlj sa.h 



408 SpecJo! Regression Methods 

fertilizers to each of the msecticidc-and-variety combinations of our 
example and apply each of 2 or more different methods of plowing to 
each of the 27 combinations of 3 factors— and so on These expciiments 
would be analogous to multiple regression with 3 or 4 independent 
variables 


Table 23 8 

Hypothetical Dajji ok Corn Yields Resulting from Combinations 
Of 3 Varieties and 3 Insecticides 


Insecticides 


Varieties 


Totals 

Means 



^3 





4060 

SS 88 

59 07 

155 55 

51 849 

At 

58 63 

119 83 

123 75 

302 21 

100 736 

A2 

56 72 

12815 

123 07 

307 94 

102 646 

Totals 

155 95 

303 86 

305 89 

765 70 

85078 

Means 

51 986 

10) 286 

101 962 

85 078 



One peculiarity of the analysis of variance just summarized is the use of 
the socalled “interaction term” in the denominator of the /'-ratio 
Presumably random factors not related to insecticides or varieties are 
responsible for a portion of the observed vanalion in yields If the effects 
of insecticides and of vancties upon yield are strictly independent of one 
another, the interaction term gives us an estimate of the level of other 
(random) effects But it may be that there are joint effects of insecticides 
and vaneties m addition to the independent ones, if so the true random 
component would be smaller than the interaction term Hence, use of the 
interaction term in the above fashion gives a conservative estimate of the 
/■-ratio If w e replicated our experiment two or more times we could gain a 
belter estimate of the random component from the variance of individual 
yields about the mean obtained for each of the 9 combinations of insecti- 
cides and vaneties 

Applications of Variance Analysis: (6) Relotion to Multiple Curvi* 
linear Regression. The following experimental study by Heady, Pesek, 
and Brown (referred to earlier in Chapter 21) is classically suited to our 
present purpose The complete set of 114 observations is included m 
Table 23 9, the structure of which clearly suggests a multiple regression of 





Cross-Chssificaticn and Analysis of Variance 

y Jj 

Table 23.9 

Experimevtal Yields of Corn for Varvivo Lfveis or Ft .-r.tpf" 
Application- on Calcareous Ids Silt-Losm Soil in NVEiTtRs lo-.v.- is’ ioo 
(yields are in bushels per acre")* 


P205 

(pounds) 





Nitrogen 

(pounds) 




0 

40 

80 

120 

160 

200 

240 


520 

0 

24.5 

23.9 

28.7 

25.1 

17.3 

7.3 

16.2 

26,8 

25.1 


6.2 

11.8 

6.4 

24.5 

A 2 

10.0 

6.S 

7.7 

19.0 

40 

26.7 

60.2 



96.0 

m 



S1.9 


29.6 

82.5 



107.0 

95.4 



76.4 

80 

22.1 


99.5 


115.9 


IHl 


129.0 


30.6 


115.4 


72.6 




82.0 

120 

44.2 



119.4 

113.6 



114.9 

124.6 


21.9 



97.3 

102.1 



129.2 

83.0 

160 

12.0 

96.2 

102.2 

133.3 

129.7 

105.7 

l.'O.S 

1 23.6 

135.6 


34.0 

80.7 

108.5 

124.4 

116,3 

1 1 5.5 

124.5 

142 5 

122.7 

200 

37.7 

SI.l 



128.7 

140,3 



HB 


34.2 

51.0 



109.3 

142.2 



Wm 




97.2 


127.6 


121.1 


130.9 


35.0 


107.8 


125.8 


114.2 


144.9 


32.4 



129.5 

134.4 



150.0 

124.S 


27.4 



125.2 

127.6 



141.9 

114.! 

320 

5.3 

79.5 

116.9 

135.7 

122.9 

138.7 

127.3 

131.8 

1 27.9 


17.9 

39.7 

83.6 

121.5 

122.7 

126.1 

139.5 

11 1.9 

118.8 









410 Speciat Regression Methodt 

yields upon two diRerent plant nutnents, which we will refer to as nitrogen 
(N) and phosphate (P) As each combination of quantities of N and 
P was applied to two different plots, the variance of these two yields about 
their mean, with one degree of freedom for each of the 57 combinations, 
provides an estimate of the random error component in yields, independent 
of the effects (additive, joint, or both) of the fertilizer elements 

The meaning of partial or net regression is also implicit in Table 23 9 
For each of 9 levels of phosphate application, a net regression line or 
curve can be fitted to the observations on com yield and quantity of 
nitrogen applied Similar net regression curves can be derived for the 
other plant nutrient That this is a regression rather than a correlation 
model IS evidenced by the selection of widely spaced values of the inde 
pendent factors and a heavy concentration of observations toward the 
extreme ends of their ranges A random sample from a normal tnvanate 
universe would give a large number of observations near the point of 
means of the two independent factors and a very small number near the 
extremes 

The authors proceeded to fit the following multiple curvilinear regres- 
sion joint function by least squares (as noted earlier m Chapter 21) 

Y= -5 682 -0 3I6N-0417P 
(0040) (0040) 

+ 6 35I2VN + 8 5155VP + 0 3410VTjp 
(0 8676) (08680) (0 0385) 

The figures in parentheses are standard errors of the regression coefficients 
The f-ratios range from 7 32 to 1044, all highly significant at the l-per 
cent probability level 

The analysis of variance associated with the complete experiment is as 
follows 


Source of Variation 

Degrees of 

Sum of 

Mean 

Freedom 

Squares 

Square 

Total 

lU 

242,707 


Treatments 

56 

233 811 

4.175 

Due to regression 

5 

222 828 

44,566 

Deviations from regression 

51 

10,983 

215 

Among plots treated alike 

57 

8,896 

156 


The regression is, of course, highly significant The variance of deviations 
of treatment means about (he regression is small enough to be accounted 



Cross-Classification and Analysis of Variance 

for by the same forces responsible for variation amon- plots treated r-llU- 
Thus there is no point in searching for a better rctrression surirc-- u hi'- 
another equally good might be found it would be impossible lo'shov, tint 
it gave a significantly better fit to the data. 

By way of contrast, the authors had fitted a different curvilinear ioir,; 
regression surface about which the variance of treatment means 'l as tl*:' 
the variance among plots treated alike remained, of course, at 1 56. The 
corresponding F-ratio, 4.01. would occur slighth less often than 1 lime'in 
100 experiments if the deviations of treatment rneans Crom the regression 
surface were due e.xclusively to the same random factors th-ni caust^d yields 
10 vary among plots treated alike. It followed logically that one o/morc 
other regression surfaces might be found that woufd beUcr express the true 
(universe) relationship between com yields and applications of nitrocen 
and phosphate. 


Summary 

The relation of one variable to several others may be approxnnatcK 
determined by detailed cross-classification of the obscr.-ations according 
to combinations of class intervals of all the independent variables. Verv' 
large total numbers of observations are required to make the croup 
averages accurate, however, for the number of groups increase', rapidlv 
with additional variables. Simple or multiple regression equations can be 
fitted to the individual observ'ations, or to the group averages, preferable 
weighting each average by the number of observations on vviitch it h bn^ed. 
Correlation coefficients based on group averages are oficn much higher 
than those based on the original observations, as the averagos supprc'S 
much of the random variation present in the latter, coritiation coLfficients 
and standard errors of regression coefficients based on grouped dat;; must 
be interpreted with care. 



412 Special Regression Methods 

Histoncally, variance analysis has been applied chiefly to discontinuous 
or grouped data When the pnnaplcs of classification were quantitative, 
the analysis has generally stopped (explicitly or implicitly) with discon- 
tmuous “lines of averages” connecting the group means All the variation 
of group averages of the dependent vanable about its general mean « 
presumed to be explained by the independent variables A continuous 
regression function fitted to the same data would not pass precisely 
through all the group averages (except in rare or tnvial cases), so the 
proportion of total vanation “explained” by the regression function 
would be somewhat smaller than that attributed in variance analysis to 
differences among group means 

The historical distinction between the two approaches is not required 
by the underlying statistical theory, and today statisticians dealing with 
data from controlled experiments often fit regression functions to the 
group means and appraise the goodness of fit of the functions usmg 
significance tests customary m variance analysis These tests lead to the 
same estimates of probability levels as do the standard-error formulas 
applicable to regression constants Both approaches may be regarded as 
based upon the same general theory of least squares estimation As in 
regression analysis, attempts to measure the separate contributions of the 
different mdependent factors to total variance often encounter the difficulty 
of cotariance, which is essentially intercorrelation among the independent 
variables 

The methods of vanance analysis may also be used to test the significance 
of the improvement m fit shown by a curved regression over a linear one, 
of additional constants in fitting a curve, or of an additional vanable m a 
multiple regression and correlation problem 

REFERENCE 

Williams E J Regression Anafysss John Wiley and Sons, Inc , 1959 



CHAPTER 24 


Fitting systems of two 
or more simultaneous equations 


Introduction. In ]943, Haavelmo introduced a drasticaliv different 
method of statistical analysis for estimating relationships amonn economic 
time series.^ Although this method was designed to handle problems in 
the field of economics, similar problems may well exist in other discipline.- 
such as biology and physiology. 

A complete description of the computations involved in handims fairh 
large sets of simultaneous equations is beyond the scope of tins' hock. 
Friedman and Foote^ have published a handbook which is recommended 
to those who are interested in applying this approach to any but the 
simplest cases. Here we will simply examine the logic and mathematics on 
which the method is based, as illustrated by a iwo-equation model. 



414 Special Regression Methods 

and road surfaces might alter the numerical values of the coefficients, the 
general form of the relationship would remain parabolic On even firmer 
grounds the trajectory of an artillery shell would follow a parabolic “law” 
except as modified by wind resistance or other additional influences 

These two interpretations have their counterparts in the analysis of 
economic time senes Thus, we might find that a set of vanables such as 
the number of hogs on farms on Januaiy 1, the January level of steel 
production, and the January index of wholesale prices (all commodities) 
showed a high degree of correlation with the average retail price of pork 
during January-March This would be a purely empirical relationship, 
though possibly a useful one Economic theory does not provide us with 
any simple interpretations of the net regression coefficients in this case, 
although some rather long and roundabout chains involving correlations 
with a senes of intermediate vanables might by hypothesized 

The other interpretation might be illustrated by a regression analysis 
relating the per capita consumption of pork m a given year to the retail 
pnee of pork, the retail pnee of beef, and the per capita disposable income 
of consumers The income senes and the meat prices might be deflated by 
an index of retail prices of all consumer goods and services, an index of 
retail prices of all consumer goods and services other than pork and beef, 
also deflated by the index of all “consumer prices,” might be included as 
an additional variable Economic theory tells us that these vanables m 
these forms logically belong in a consumer demand function for pork, 
It tells us which of the net regression coefficients should be positive and 
which negative, and places definite limitations upon the shapes of the net 
regression curves A relationship of this type is sometimes called a 
‘ behavior equation” — i.e , consumers change their purchases of pork 
because they experience changes in their incomes and in the retail prices of 
pork, beef and other things These vanables have immediate logical 
connections with consumer decisions, whereas consumers have no direct 
experience of changes in steel production, hog numbers, and wholesale 
prices 

In the terminology of simultaneous equations analysis, our second 
regression represents an attempt to estimate “structural coefficients”, if 
our estimating procedure is appropnale the equation as a whole will be a 
“structural equation ” Conceptually, it is quite different from the 
empirical estimating equation of our first example 

In brief, the object of the simultaneous equations method is to determine 
“structural equations”, its proponents argue that, in most situations 
involving economic time senes, “structural equations” cannot be estimated 
satisfactonly by the least-squares, single-equation methods presented in 
the first twenty-two chapters of this book 



for Systems of Simultaneous Equations 

Basle Concepts, Problems, and Depnitions. V,'c arc al! familiar -^ith 
the economic concept that the price of a commodity dctcrrr.ined by the 
‘ntersection of a supply cur.-e and a demand cur.x*. But price can aho be 
reearded in an “active" sense as bringing the quantity supplied and the 
Quantity demanded into balance-in fact, cquality-svith each other. 

If we could perform some controlled experiments in which vc ioed pncc< 
at various levels and measured separately the quantities supplied tsoldi by 
producers and the quantities demanded (purchased) by consume!,, at 
Inch, oricc the reeression of quantities said upon price would give u'- me 
suDolY cuWe (NvUh a positive slope, typically) and the repression of 
Quantities purchased upon price would give us the demand curse tvsali a 
ncoativc slope). Note that the quantity supplied by rrooucers v ou.d not 
ne^ssarily equal the quantity demanded by consumers under the conditions 
Sis experiment; to actually hold the price at some predetermined lesci 
we must stand ready to buy surpluses from producers and to asc. . po-.ib.e 
shortages by selling to consumers from stocks under our control. ^ 

In practice we may be unable to perform such e.xperimcnts. Esen ir.m 
could^ we would still be loath to forego the information contamca . , 
hml'scrlcs observations for earlier years, and these observation, aid nm 

n rise from a controlled experiment. r , 

l et us assume that we have only one measure o. quantity fo> c... 

vear-i e the quantity supplied by producers is identical witn the quardit. 

Purchased by consumers-and call the pricc-quanlily ooserra lom m t. o 

^ fpn) and (Po Oo). Suppose we note that I ~ i- 

successive years (Fi, Qi) ano trj, ^ , 

increased °7arcer tha^Qi for producers' would find it profit'abie 

would buy a smaller quantity at th ^ P higher I'nan 

To be perfectly concrete, suppose we find that F, -U 
Fi and that g. is 10 units higher than gj. so the ratio 

I Pe — ^i \ _ ^ = 7 no. 



416 Special Regression Metfiodt 

of 0 50 And each of these alternatives would be consistent with vanous 
combinations of slopes and shifts of the (equally unknown) demand 
function (We assume P to be measured along the vertical, and Q along 
the honzontal, axis of a diagram) 

In the universes to which least-squares regression equations apply, 
can reduce the standard errors of our estimates of universe parameters 
indefimtely by increasing the number of observations in our sample 
But (m simple linear regression) any pair of observations will give us an 
unbiased estimate of the regression slope The present problem is different 
m that each pair of observations gives us (in general) a biased estimate of 
the universe parameters We cannot improve our estimates of these 
parameters by simply increasing the number of biased estimates implicit 
in each pair of observations It is different also in that we are trymg to 
estimate tu o universe regressions from a single set of values of P and Q— 
one regression which typically has a positive slope and one which invariably 
has a negative slope These are not the two “elementary regressions" of 
least-squares theory, for, from the arithmetic of regression analysis, the 
coefficients 


must both have the same sign or both be zero Our problem cannot be 
solved by choosing either P or g as “the" dependent variable The 
economics of the situation implies that P (as an annual average) and 
Q (as an annual total) have equal claim to this status if both producers 
and consumers are free to adjust the quantities they sell or buy dunng the 
course of the year * 

The first step in a simultaneous-equations analysis is to “specify the 


* If we thought of producers as offering a certain defintle quantity for sale m 
Week I, consumers would pay a price Pi corresponding to that quantity on their demand 
function Then, suppose that producers determine the amount they will offer in Week 2 
on the basis of the lagged ’ supply function 


9*“Oi + *iPi 

or, more generally, 

q,~at + btPt t 

If the quantity so determined, differs fromyi, consumers will pay a price P, different 
from Pi, so that, from Week I to Week 2, the price will have moved along the consimier 
demand function according to the fonmila 

(P»-Pi) = ^(yi— Of (?, - y,) = Ai(Pt - Pi) 

If this weekly response mechanism operated throughout the year, then from Week 52 
in Year 0 to Week 52 in Year 1 there would have been 52 supply responses, cadi 



For Systems of Slmultancoui Equations 

4/7 

model” by which the observations were generated. Thi^ k f.. 

defining the universe with respect to which a nben set of 

sampling significance. In the present case the model will 

follows: “ j . wi,. .. 


Demand curv'c: Q = a^^i^ biP 4- u 

Supply curve : Q = a. + b,P -f r 


In these equations, Q stands for quantity and P for price; u and r arc 
“disturbances,” random with respect to time and hasinc an cxp-ci-i 
value (universe mean) of zero. Both equations contain the samVf'vo 
variables. With least-squares methods we can calculate the rearesrion of 
Q on P and the regression of P on 0— two different cquations~bu; there 
is absolutely no basis for identifying one of these as a supply cunc and the 
other as a demand curve. 

The disturbance u in the present model cannot be treated as a random 
error in either Q or P; rather, it causes random shifts in the kre! of the 
demand curve. Similarly, v causes random shifts in the level of the supph, 
curve. As the cuiwes shift about, their points of intersection constitute the 
values of P and Q in the successive time periods. (The values of P and Q 
are assumed to be measured without error.) If the demand curse remained 
fixed for several periods all the points of intersection would he on the 
demand curve, which could then be estimated from the P and O vrducs bs 
least squares. If the demand curs'e shifted while the supply curs-e remained 
fixed, the points of intersection would lie on the supply cunc; in this case 
the least-squares relation between P and Q would give us the suppls 
curve. In general, both curv'es are likely to shift and the least-' quarcs 
regression of 0 on P would represent neither a supply nor a demand 
curve but some uninterpretable mixture of the two. 

If an infinite number of obsetv'ations were drawn from the unH’crsc 
defined by equations (24.1) and (24.2), the expected (univer.-'C) saluc of 
the least-squares regression of 0 on P would be: 


-f b,)a„^ + 
c; - 2a,^,. -f (4 


(2d.?l 



4/8 Special Regression Methods 

where = PuttT„cr, This can be demonstrated by writing equations 
(24 1) and (24 2) with each variable stated in terms of deviations from 
Its mean 

? = V + « (24 4) 

9 = V + * (245) 

Solving these equations simultaneously for p and ^ m terms of the dis* 
turbances ii and t we obtain 




0 — u 


9 


biV — 

-*2 


The least-squares regression of ^ on p is 

_ ^ - bzu)(v — u)] 

Sp» “ r(p - H)* 


(24 6) 
(24 7) 


(24 8) 


If we carry out the muUipUcations and summations indicated and divide by 
the total number of observations, we obtain equation (24 3) If the demand 
curve has not shifted, «» 0 and B = 6j, if the supply curve has not 
shifted, Op = 0 and B o, b^ But even if one of these special cases has 
actually existed we may not know it, in general, if equations (24 1) and 
(24 2) constitute the true and complete model, there is no way in which 
b^ and can be estimated from the data In current terminology, neither 
equation is “identifiable” and the system as a whole is “undendentified ” 
The situation can be resolved, however, if each equation happens to 
contain a different “predetermined vanable” A predetermined variable 
IS one whose values are known or causally determined before the current 
time period or by factors outside of the immediate supply-demand model 
(The outside factors are also referred to as “exogenous variables ”) 
Statistically, these new vanables (predetermined, including exogenous) 
are treated as given numbers, analogous to the independent variables of 
least-squares regression analysis In contrast, P and Q, called“endogenous 
vanables,” are analogous to the dependent vanables of least squares 
regression, their values are dependent upon those of the predetermined 
variables acting in conjunction with the random disturbances 
Suppose, then, that the true model is as follows 


Demand function 0 — fli -f 6,P+ c,y -f ir 
Supply function 0 = <^2 + + c^Z + i 


(24 9) 
(24 10) 



For Systems of Simultaneous Equations 

uhcrc u and r again arc random disturbances. cau<cd b> a 1art;c fui:r<bcr of 
more or less remote events in tJie economic ssstem, none of vvl-.ici'. taken 
separately is of much consequence or has nicarjraf'k' effects on our 
particular demand-supply system. For concreteness, let us sav that Y h 
consumer income and Z is an indet. of factors. Known or loeicr:!!', 
determined prior to January i. which affect the quantity of pork that w-il 
be supplied during tite ensuing year. We shall treat )’ ns an evorenous 
variable, which influences tlic endogenous variables P and Q buris not 
influenced by them. This is a plausible approximation, as consumers 
spend only about 2 per cent of their incomes on pork .and. conver‘-eK. 
not more than 2 per cent of consumer income is derived from activities 
closely associated with pork production. We might rcasonablv assume 
that some 98 per cent of the variation in F is determined by factors other 
than the price and quantity of pork and that, if our results are biased by 
the fact (hat J'is not 100 per cent exogenous, the bias will not exceed 2 
per cent. 

The values of P and Q, then, arc dependent on those taken by Y, Z. 
u, and v. We shall therefore solve equations (24.9) and (24.10) for P 
and Q as functions of the predetermined variable.s and the disturbances.^ 
This solution leads to the following two equations: 

F = /f, -f £?, r -f CjZ -f r/j (24. 1 1 ) 

Q = A.+ B.Y + az -f </, (24.12) 

These equations arc called the “reduced form” of the structural model, 
or “reduced-form equations.” As indicated in Appendix 3. Note 6, the 
coeflicicnts arc definite algebraic combinations of the cocfiicients of the 
two structural equations; r/j and r/, are combinations of the disturbances 
(it and r) and the structural cocfncicnts h, and h,. 

Each of the reduced-form equations contains a single dependent varivable; 
the values of Y and Z are given numbers, like indep-endent variables in 
the “regression model” of least squares; and and iP have the same 
statistical properties as the random residuals of least-squares rcgrcs'ion. 
being uncorrclated (by assumption) with )' and Z. Therefore each of tiic 
reduced-form equations can be fitted separately by the familiar method of 
least squares. Then, as we know precisely what combination of the 
structural coefficients is included in each coefficient of the reduced-form 
equations, we can derive the structural cocfiidcnls by a few .arithmetic 
operations on those of the reduced form. 

Specifically, hj — CJQ and lu = BJBy Knowing hj and K, we can 

‘ Sec Appendix 3, Note 6. for the details of this solution. 



Table 24.1 

Basic Data for “JusT-lDENnnED” Model of Supply and Dokand for Pork, United States, 1922-1941 
Original Arithmetic Values 




For Systems of Simultaneous Equations ^2 f 

calculate Cj = i?j( fZij — />,)] and = Cjf/'j — F,). Ti;£ value? c. and c, 
may be obtained by solving the following simuhancotK c<|uat!on?: 

— -f — hA{—lu) 

h^a, — = A^ihi — />,) 

Adding the two equations, wc can solve for o, as; 

^ Ifj) * 1 * — /’;) 

o, ^ 

for at this stage all items in the right-hand term arc known numbers. 
Then, dividing the first equation through by wc obtain 

(7 1 — *“ — hA 

It is important to note that this method of estimation flows directly 
from the specification of the model in equations (24.9) and (Zd.lOk 
Without this specification it is quite unlikely that a researcher interested 
in demand curves and supply curves would have fitted either of the 
reduced-form equations. However, if he had been interested in/orccntn^ig 
P and Q for the coming year on the basis of information available as of 
January 1 he might very well have fitted these equations, using the 
variables Z and )", Y' representing an advance forecast of But he 
would not have called cither of the equations a demand curve or a supply 
curve; they would simply have been prediction equations without 
structural significance. 

Illustration of a *'Just~ldcntlpcd” Model. The model embodied in 
equations (24.9) and (24.10) is said to be “just identified” because it 
permits us to obtain a single unique estimate for each of the slnicturnl 
coefficients. This contrasts with the wxdcridcniificd case previously 
mentioned in which some of the structural coefficients cannot be estimated 
at all. and with the “overidentified” case in which some of the coefficients 
can be estimated in two or more ways yielding different values. 

As no new statistical calculations arc involved in the just-identified 
case, wc shall present only the final equations obtained in an attempt to 
derive simultaneous demand and supply equations for pork. The same 
symbols arc used as in equations (24.9) and (24.10); the basic dtsm were 
first difrcrcnces of logarithms of annual observations for the period 
1922-194! (sec Tabic 24.1). 

The reduced-form equations arc as follows; 

P = -O.OIO! -f 1.0S13 r - O.S320Z R- = 0.S93 
(0.1339) (0.1159) 

Q = 0.0026 - O.OOIST -f 0.6S392 
(0.0673) (0.0582) 


R' = 0.S9S 



422 Sfiecffll Regression Methods 

The numbers m parentheses arc standard errors of the net regression 
coefTicients The coefficients of the structural equations are calculated 
as follows 

, -000J8 




10813 
0 6839 


- = -0 0017 


= -0 8220 


-0 8320 

Cl = 1 08I3{-[-0 8220 - (-0 0017))} 

= 10813(0 8203) =0 8870 
s -0 8320(-0 8203) = 0 6825 

To compute the constant terms o, and we solve the following two 
simultaneous equations 

0 0017<7a -OOOnu, = (-00108)(-0 8203)(0017) =0 0000 
-0 822017a - (-0 001?)Oi = (0 0026)(-0 8203) = -0 0021 
Adding the two equations, we obtain 

-0 820302 = -0 0021 + (0 0017K0 0089) = -0 0021 
-00021 


-0 8203 


= 00026 


Then, substituting tTj = 0 0026 into the first equation divided by 00017, 
we have 0 0026 — <7i = 0 0089, or <7, = —00063 This completes the 
set of coefficients for the two structural equations Consequently, the 
structural equations can be written as follows 
Demand function g = -0 0063 - 0 8220? + 0 8870 Y+u 

Supply function Q = 00026 — 00017? + 0 6825Z + v 

Because the data were m logarithmic form, the coefficients represent 
approximately the percentage changes in Q associated with changes of 1 
per cent in each of the other variables In particular, the coefficients of 
? are estimates of the elasticities of demand and (simultaneous or con- 
current) supply for pork Although standard errors of the structural 
coefficients are not presented here, such errors (appropnate for large 
samples) can be computed 

For each observation values of u and v are computed from the following 
forms of the last two equations 

»i= + 0 0063 -1-0 8220?, -0 8870 Fj 

It = - 00026 + 00017?, - 0 6825Z, 



For Systems of Simultaneous Equations 423 

The subscript t simply refers to the particular year for vriiich the disturb- 
ances are being calculated. The disturbances for each \c.ir. and thrir 
successive difiercnccs, arc shown in Table 24.2. The von Neumann 
ratios prove to be as folIo\v.s: 


Demand function: 




/.008017wI9\ 
1.003553/ 1 1 8/ 


Supply function: 


= 2.256403(1.055556) = 2.3S1760 



= 2.238919(1.055556) = 2.363304 


Referring to Table 20.5 for A' = 19 observations, we find no evidence of 
significant autocorrelation in either set of disturbances. 


Table 24,2 


DisTunnANCES Co.Mi'urtD from Strltturai. Houatjons 


Disturbances in Demand Function 

Vi»nr 

Disturbances in Supply Function 


«< 

Uui - lit 

ft 

ff-i - ft 

1923 

-0.013 

. . . 

0.009 


1924 

0.010 

0.023 

0.013 

0.004 

1925 

0.020 

0.010 

-0.005 

-0.018 

1926 

0.002 

-0.018 

-0.00.8 

-0.005 

1927 

0.011 

0.009 

0.001 

0.009 

1928 

0.002 

-0.009 

0.009 

O.OOS 

1929 

-0.010 

-0.012 

0.000 

-0.000 

1930 

0.022 

0.032 

-0,014 

-0.014 

1931 

0.003 

-0.019 

0.013 

0.027 

1932 

-0.022 

-0.025 

-0.017 

-0.030 

1933 

-0.014 

O.OOS 

-0.004 

0.013 

1934 

0.024 

0.03S 

-0.031 

-0.027 

1935 

-0.016 

-0.040 

0.005 

0.056 

1936 

0.010 

0.026 

-0.010 

-0.015 

1937 

-0.003 

-0.013 

-0.003 

0.007 

1938 

0.014 

0.017 

0.019 

0.022 

1939 

-0.007 

-0.021 

0.000 

-0.019 

1940 

-0.014 

-0.007 

0.025 

0.015 

1941 

-0.010 

0.004 

-0.003 

-002S 



424 Special fiegressfon Methorfj 

Comparison of Structural Equations with Their Least-Squares 
Counterparts. One can scarcely argue with the logic of the method of 
reduced forms if two or more endogenous variables are in fact simultane- 
ously determined But universal application of this approach to economic 
time senes has been opposed on both pragmatic and theoretical grounds 
The pragmatic argument is based upon comparisons of published struc- 
tural equations with least-squares relationships involving the same vanables 
In most cases the least squares coefficients have been very close to those 
of the structural equations 

For example, the demand function for pork has been fitted by least 
squares in two different ways When Q is treated as the dependent 
variable we obtain 

Least squares Q ~ —0 0049 — 0 7205P + 0 7646 Y /J* = 0 903 
(00594) (0 0967) 

Structural Q= -0 0063 - 0 8220? + 0 8870 T -f u 

In each case the structural coefficient is within less than two standard 
errors of the least-squares coefficient, the differences are not statistically 
significant 

When P IS treated as the independent vanable, we obtain 

Least squares ? « -0 0070 - 1 25185 + 1 0754 Y « 0 956 
(01032) (00861) 

For easier comparison, we dmde all terms in the structural demand 
equation by the coefficient of P (using the positive sign) and transpose P 
and Q to opposite sides of the equality sign, obtaining 

structural P= -0 0077 - 1 21652 + 1 0791 y + 

The coefficients of the two equations are almost identical — they differ 
by small fractions of one standard error Evidently the least-squares 
equation with pnee dependent gives an excrIJent approximation to the 
structural demand function And the logic of the simultaneous-equations 
approach supports the choice of price as the dependent vanable in our 
least squares demand function i/ there is no simultaneous response of 
supply to pnee ® 

* See Karl A Fox, Structural analysis and the measurement of demand for farm 
products. Review of Economies and Stalutles,Vol XXXVII, No ],pp 57-66, February, 
1954 



For Systems of Simultanccus EquaVsns 42s 

The corresponding comparison for the supply function {-.vith Q de- 
pendent in the Icast-sqtiarci. equation) is as follou^; 

Least squares; Q = 0.0022 - O.OTSSF -h 0.6090Z K' 

(0.05221 (0.072-!l 

StructuraJ; Q = 0.0026 — 0.0017 P -f- 0.6%2SZ -f r 

The coefficient of P in the Icast-.^quarcs supply function is non-^icr.ificant 
according to the usual criteria and has a negative sign, whereas normatlv 
we would expect a true supply function to show a positive re'^potj';: of 
quantity to price. 

If we discard the price variable from the least-squares supply function 
as non-significant we obtain simply 

Least squares: . Q = 0.0025 + 0.684 1 Z r- ~ 0.898 

(0.0857) 

Tliis is almost identical with the structural equation, rccognbing that 
the coefficient of P in the latter is negligibly small and would prove to be 
statistically non-significant if its standard error were calculated. 

When a model of the United States economy during 1929-1952, 
involv'ing 15 simultaneous equations with a total of 51 stnictural coef- 
ficients, was compared with its least-squares counterpart, the following 


results were obtained:*' 

Ratio of DifTcrcnce 
between Coefficients 
to Standard Error of 
the Structural Coef- 
ficient 

Between 

Constant 

Terms 

Between Net 
Rcgrc;*-ion 
Cocfilclcnis 

Tom! 

0 —0.49 

7 

14 

21 

0.50—0.99 

5 

6 

n 

1.00—1.49 

2 

S 

10 

1.50—1.99 

1 

A 

3 

2.00—2.99 

0 

4 

t 

3.00 and over 

0 

2 


Total 

15 

3?> 

5\ 


Thirty-two of the least-squares coefficients were within one standard 
error, and 45 within two standard errors, of the corresponding structural 

* From Karl A. Fot, Econometric models cf the United Stater, yov-neJ of 
Economy, Vol. LXIV, No. 2. p. 131, April, 1956. See also TaKc 3. pp. 135-136. 

TIk original structural model will be found in L. R, Klein nr.'t A. S. GoldSrtrcr. 
An Exonometric Mode! of tho Urised Slates, 2929-1952, Nor:!!*Ho'l.'.rd 
Company, Amsterdam, 1955. 



426 SpeclaJ Regression Methods 

coefficients (The standard errors of the differences between coefficients 
obtained from two random samples of the same size drawn from the 
same universe would average about 1 414 times as large as the standard 
error of either coefficient taken separately, the factor 1 414 is equal to the 
square root of 2) If we apply this analogy it appears that no more than 
6. and possibly only 2, of the differences would be significant at the 5-pcr 
cent level 

The similarities between the simultaneous-equations model and its 
least-squares counterpart extend to the standard errors of the coefficients 
(most of which were within 10 per cent of one another) and the extent 
of autocorrelation in corresponding equations Severt^f the structural 
equations showed significant positive autocorrelation, the least-squares 
counterparts of six of these seven equations also showed significant 
positive autocorrelation An earlier model by Klein included 37 coef- 
ficients. 26 of these were within one standard error and 32 within two 
standard errors of their least-squares counterparts 

In the simple iwo-equation mode! for pork the similanty of the least- 
squares demand function to the corresponding structural equation has 
both theoretical and statistical explanations In the first place, 90 per 
cent or more of the variation in pork production was attributable cither 
to variables which were predetermined as of January 1 or to exogenous 
factors, such as the effects of weather and disease upon the number of 
pigs saved per litter Under 1922-1941 conditions, most hogs were 
marketed at the age of 8 or 9 months, the gestation period for pigs is 
about 4 months With this built-in lag of 12 or 13 months between sow 
breeding and hog slaughter (equals pork production), hog producers had 
very little latitude to change the current year’s production of pork in 
response to the current year’s price of pork or hogs Hence, there was 
little reason to expect a significant net regression coefficient between the 
current price of pork and the current quantity of pork in the supply 
equation As a matter offact, this structural coefficient was non significant, 
being based on a non significant coefficient in the reduced form of the 
model Moreover, as already noted the structural coefficient was negative, 
whereas one would ordinarily expect an increase m price to induce an 
increase in supply 

If we discard P from the supply equation as non-sigmficant, we are left 
with the least-squares equation 

0 = 00025 -1-0 6841Z 
(0 0857) 

’ Both the structural equations and the least squares equations for this model are 
presented in Lawrence Klein Economic Fluctuations in the United Slates, 1921-1941, 
pp 108-114, New York, John Wiley and Sons, 1950 



427 


For Systems of Simultaneous Equations 

as a basis for estimating current pork consumption from a compoMte of 
predetermined factors affecting the production of pork. T itis ne\s e'.tima?’ 
ing equation also helps to c.splain \shv the least-squares demand function 
is so much like its structural counterpart, l-'or ir90 per cent of the sari.isjon 
in Q is associated with a predetermined variable its statistical properties 
will be very similar to those of a predetermined sariablc. If any bins iv 
invohed in treating Q as a predetermined variable in the demand equation 
it would seem intuitiscly that the bias would not exceed 10 per cent. 

On this argument, our least-squares demand equation fitted with price 
as the dependent variable is roughly consistent with the theory underKing 
the .simultaneous-equations method. In fact, a siructura! equation winch 
includes only one endogenous variable, the others being predetermined 
or exogenous, is called a “uniequational complete model. “ In this model 
a least-squares regression with the endogenous variable dependent giscs 
us the best possible estimates of the structural cocnicients. A good many 
least-squares demand functions for perishable foods can be rationnlired 
on this basis. 

The least-squares method is also applicable to the situation in which 
quantity sold (and consumed) in the current period is a function of price 
during the preceding period. The mode! here is 

Demand curve; /’, = Oj -f /i,(?, -f » (24.13) 

Supply curve; Qi ~ (^2 + i + ( (24.14) 


If neither u nor v is significantly autocorreialed. the icast-squ.arcs regres- 
sion of P, on Q( is an appropriate estimate of the demand curse, and the 
least-squares regression of Q, on is an appropriate estimate of She 
supply curve, 

Wold^ and others have argued that most economic problems to which 
simultaneous-equations methods iiaec been applied could actually be 
expressed in terms of equations' such as (24.13) and (24.14) abo\e. In 
Wold's terminology, each equation would express a “uniiatcnil cans.!! 
dependence" and'could be fitted separately by least-squares with tlse 
logically dependent variable in the dependent position. 1 he complete 
set of equations would constitute a "recursive system ' or "recursive 


* See Herman Wold and Lars Jiirccn, PemanJ .•tf.-a.T.nf, New X’ork. Jotr, V, rev and 
Sons. 1953. 



42S Special Regression Methods 

mode! ” To achieve this unilateral causal dependence, the tune unit of 
the model would be chosen to correspond to “the intervals of production 
planning ’’ For many agricultural products the logical interval is a year, 
m general, the planning interval might vary from industry to industry 
Wold’s scheme appears reasonable when applied to individual com- 
modities, Its practicality has not been tested m more extensive models 
such as the Klein Goldbergcr model of the United States economy 
mentioned on page 425 

5/mu/taneous-Equat)ens Alet/iods for **Over/dent//}ed’' Models, The 

baste problem of oLertdeniipcauon can be illustrated in a two-cquation 
demand and supply model for beef The structural equations are assumed 
to be 

Demand 5 — -f i/j IF u (24 15) 

Supply Q — ^2 ^ (24 16) 


where Z is an estimate of beef production based on wholly predetermined 
variables and IF is consumption of other meats, assumed predetermined 
for present purposes The reduced form of this model is 




(24 17) 


(24 18) 


We can estimate as a ratio of the coefficients of Z in the two equations 
But bi can be estimated m two ways, one as a ratio of the coefficients of 
Y and the other as a ratio of the coefficients of IF If the two estimates 
of •MJak, 'K'l ca.?. okfjua. dfffivAok Oj, Og, c^, Cj, V}/i 

The reduced form equations fitted to logarithms of the vanables dunng 
1922-1941 by least squares arc as follows, neglecting the constant terms 


P = -f 0 81857 - 0 8521Z - 0 4346 IF + /^(w, v) 87 

(01061) (01877) (01631) 

A, + 0 0509 y -b 0 8801Z - 0 0899 IF + Mu. v) J?* = 0 87 
(0 0545) (0 0964) (0 0838) 



For Systems of Simultantou; Eqt/otians 42? 

Then = 0.8801/ — 0.8521 = — 1.0329; this is our struc'urrt} c'tsrnr.;; 
of the elasticity of demand for beef. The uso ahcmati’.c %.i!ucs of K nre 


and 




0.0509 

0.8185 


= 0.0622 


, -0.0S99 

^’e(r) = 0.2068 


-0.4346 


Methods arc available for rcsohing this apparent conflict and obtaininc 
ynique “best estimates” for each of the coefTidcnis of an ovcridcntificd 
model. However the computations arc quite c.xtcnsivc and ssill not be 
demonstrated here. 

Some pragmatic comments arc also warranted in the present case. 
First, neither of the reduced form cocfticients in the numerators of the 
ratios from which b, is calculated differ significantly from zero; one 
is slightly smaller than its standard error and the other is only slightly 
larger. Starting from such unpromising materials it seems likely that any 
compromise estimate of Z?,. which must be analogous to a weighted as crags 
of bjti), and h-fej, will also be non-significant. If we discard }' and H' 
from the second reduced-form equation we obtain a lc.ast-squ3rcs. estimate 
of 2 as a function of the composite predetermined variable 

Q = Aa + 0.8878Z r) r* = 0.85 

(0.0900) 

If we estimate Cj and rfj on the assumption that b. — 0, we obtain 
cj = -(O.S185){-1.0329) = 0.8454 
and </i = -(0.4346)(- 1.0329) = -0.44S9 

Our structural demand function would then be 

2 = a; - 1.0329F 4- 0.S454 }' - 0.44S9!r -r u 

Transposing P and O and dividing through by the coefficient of P , the 
structural equation becomes 

p = n;'-0.96S22 4-0.8IS5}'-0.4.'46II'-f »’ 

The same variables fitted by least-squares with P in the dependent position 
give the following result; 

F = o” _ 1.06452 + 0.S815 1' - 0.52471!' R' ^ 0.95 
(0.1179) (0.0620) (O.OSS?) 

Tltc respective coefficients differ by about one st.indard error in each cn'-c, 
or about 0.7 standard error of their differences ij the regressions had bce.n 



430 Special Regression Methods 

filled to two random samples from the same universe The similarities 
between the two equations can be rationalized on the same basis as in 
the case of pork * 

A full demonstration of computational methods for the ovendentified 
case IS be>ond the scope of this book Fnedman and Foote (o/> at, 
footnote 2} present complete computations for a lumber demand and 
supply model that is formally identical with our model for beef Twenty 
pages (pages 30 to 49) are required to present the computations and descnbe 
the successive steps involved, much more space would be required to 
explain the logic of each step The mathematical derivation of these 
methods is also much more complex than that of the normal equations 
for least-squares regression Some of the more accessible presentations 
are listed in the chapter references 

However, a recapitulation of the central ideas of the simultaneous- 
equations approach may be helpful 

J Variables are classified into two categories, (1) endogenous and 
(2) exogenous and/or predetermined 
2 One or more of the relationships to be estimated contain at least 
two endogenous variables As these variables are jointly dependent, 
it IS illogical to treat some of them as independent variables m least-squares 
regression equations 

Thus, if we denote endogenous vanaWes by Y's and exogenous and/or 
predetermined variables by Z’s, the single-equation and the simultaneous- 
equations models can be distinguished as follows 

Single y,=/,(Zt.Z, .Z*) 

Simultaneous (.Yu Yz , K,) Z^, ,Z*) 

In the simultaneous model we arc dealing with the “regression” of a set 
of dependent variables upon a set of independent variables 

* The structural equations for the beef mode) haw been estimated by Joan Fnedman 
using the simultaneous-equations method Iheoreiicatly appropriate m the ovendentified 
case The equations wiili standard errors of the respective coefficients, are as follows 

Demand P »= oj - 0910 + 0 89y - 0 47tF 

(024) (013) (0 18) 

Supply Q^a;+009P + 094Z 

(006) (Oil) 

As expected, the coefficient of P in the supply function was non significant All other 
coefficients of the least squares counterpart equations differed from corresponding 
coefficients of the structural equations much less than one standard error of the 
latter coefficients 



43 1 


for Syitems of Simultaneous Equations 

3. The number of struclural equations in 2 model miss' eau.i! the 
number of endogenous variables. If there is only one cndoecnous sariaMe" 
the model can be estimated by a single lea.st-squarcs equation. .'Mso. if anv 
individual equation in a simultancou.s system contains oniv one cndn~c- 
nous variable, this equation can be estimated bv lca<;t squares. 

4. The dassiheation of variables must bo based upon a heical amt's sis 
of their nature with respect to the economic model in question. VatialVcs 
should not be reclassified for computational convenience or to meet tbc- 
mathematical requirements for “identification.” Conccivablv. analssu. 
of a model from the standpoint of economic theory migiit indicate that 
it is “undcridentified” in the real world and that there is no ’-vav to cvtimaie 
it from the data. To proceed with any kind of statistical estimation in 
this ease would be mischievous and misleading. 

5. It is assumed that all variables arc measured without error, Ti.e 
exogenous and/or predetermined variables arc not correlated ssith the 
random disturbances; the effects of these disturbances can a'ppear. tiicre- 
forc, only in the movements of the endogenous vanabies. 

Note, however, that the disturbances arc attached to cmaiwr.s rather 
than to individual variables. This is sometimes cmphasi?ed by anting 
structural equations in the form 

Tj + 62 ^2 • • • + f / + CjZj -f CiZ, +(7 = 1/ 

Any one of the T’s could be written in the extreme left position and all 
terms divided through by its coefficient, including the \ahie of 1 / for each 
observation, as was shown on page 424 when P nas transferred to the 
traditional “dependent” position in place of 0. This operation docs not 
alter the time pattern of te (except by a scale factor), and is in sharp 
contrast with the results of fitting least-squares equations with differcnl 
variables in the dependent position. The latter almost ahsays chatigcs tiie 
time pattern of the residuals, which arc regarded as attaching on!) to 
the dependent variable. 



432 Spedo[ Regression Methods 

interpreting individual regression equations Each regression is an applica- 
tion of statistical method to a particular subject matter problem The 
choice of variables and, in many cases, types of functions to be fitted flow 
from one’s knowledge of the subject matter The precision with which 
each variable is measured may have important implications for the 
analysis If a researcher works intensively with certain types of data for 
a number of years he may develop sound judgments concerning levels of 
measurement error, probable degrees of intercorrelation, and probable 
levels of unexplained variance m new regression analyses of such data 
A person of less experience and insight applying the same basic technique 
to the same data may seriously mislead himself and others as to the 
meaning of his results 

Many applications of the simultaneous equations approach up to this 
time have been marked by preoccupation with economic and statistical 
theory and marred by lack of cumulative and intimate knowledge of the 
data to which these were applied If a “wrong” variable is included in 
one equation it may senously distort the coefficients of other equations, 
a “right” vanable subject to large measurement errors may have similar 
consequences It seems clear that the earlier proponents of this approach 
overestimated the degree of simultaneity with which certain variables were 
determirred and were therefore unduly pessimistic concerning the pre- 
valence and magnitude of “least-squares bias” in applied research 

During the next decade or two, some researchers may build up the 
combined knowledge of techniques and data that is needed for adequate 
evaluation of the simultaneous-equations approach m various applied 
fields The wider availability of elcctromc computers will make appli- 
cations of the method feasible without consuming the researcher’s time 
and energy in extensive computations But difficulties m interpreting the 
results will continue for many years And if Wold and others prove correct 
in their advocacy of recursive models, least-squares methods will remain 
dominant m dealing with non-expcnmental as well as expcnmental 
observations 

Summary 

In this chapter we have presented the basic logic of the simultaneous- 
equations method of estimating economic relationships This method 
IS logically appropriate for estimating “structural” or causal relationships 
where the values of two or mote variables arc jointly dependent within 
each time unit for which observations are available 

The single-equation least-squares method to which most of this book 
IS devoted is a valid application of the simultaneous-equations theory 



<33 


For Systems ofSimuhar)tous Equations 

\vhcnc%'cr a ‘'structural” equation hapfven'; to contain only one ioclcalh 
dependent (“endogenous”) variable. In some other vRete-cquat'ori 
methods can be made to provide close approximations to the de^irs.'d 
“structural” coefficients. 

Though designed primarily for the analysis of time-series djt.a in 
economics, there may well be problems in the biological science^ to v.htclt 
the simultaneous-equations method is applicable. Many more cmpif^ic^l 
studies w'ill be needed to establish clearly the practical salue of thtN 
method in various areas of research. 


references 

Friedman. Joan, and Richard J. Foote, Computational metliods for h.indhr.r ^-jiterrA 
of simultaneous equations, U. S. Dept. Agr. .^^riculitfc Hc'.J:-'- 9S. lu) pr . 
illus., November, 1955. 

Marschak, Jacob, Stalistical Inference in Dynamic EevKomic by Co-aIcs 

Commission Research Staff Members and Guests, edited by Tjallmg C Kojptmr,' . 
with introduction by Jacob Marschak. pp. 1-50, Monograph 10. John Wi'ey and 
Sons, New York, 1950. . 

Hood. William C.. and Tjalling C. Koopmans (cds.'l. StaMcs m Eeey-anr 

Cowles Commission Monograph 14, John Wiles and Sons New 'lou.. VK:. 

Koopmans, Tjalling C, Statistical Estimation of Simuh.ancou.'. rco'-cmie Rcl.u.;". 
Jour. Amcr. Stat. Assoc.. Vol. 40, pp. 445-466, Deieembcr. 19-15 ^ 

Hildreth. Clifford, and F. G. Jarrett, A Stalistical Stidy oj Lire:;, cf i ’t • 

Marketing, John Wiley and Sons, Nesv York. 1955. 

Foote Richard J., A comparison of single and simultaneous equ.-itioti t.vin ot.es. 
Jour, of farm Econ., Vol. 37, pp. 975-990. Decc.mber. I9S5. 



SEaiON VII 


Uses and Philosophy 
of Correlation and 
Regression Analysis 


CHAPTER 25 

Types of problems to which 
con elation and regression 
analysis have been applied 


This chapter cxammes various types of research problems to which 
regression and correlation analysts have been applied, types of logical 
analysis made, and some of the pitfalls and difficulties encountered, and 
the kinds of conclusions reached 

When such a review was first presented in 1930, 77 individual studies 
were listed In 1941, even a brief sampling of additional studies brought 
the number up to 95 In 1959, such studies have become so numerous 
that the titles alone would fill a book this large In several fields (crop 
forecasting, supply and demand analysts, consumption studies, marketing 
studies, hydrology, forest mensuration, for example) whole books have 
been written about research methods in each field 

Brief History. The methods of conclation and regression analysis 
were first developed by students of heredity, notably Karl Pearson The 
professional journal in this field, Biometnka, contains the original papers 
establishing the method, and many studies using it in the field of heredity 
These include such studies as the relation of the stature of children to that 
of their parents The very term “regression” itself comes from this 
initial use When it was found that very tall or very short parents tended 




Types of Problems to which applied ^35 

to have children who v\crc on the ai.crar.e lev', i.ili or sluiTt. Jho. v..i< 
dewibed av a tendency to “regree*; toward the ntcan,” and the line drvetih- 
ing this vsas called “the regression line.” 

The early studies were mairtiy limited to dniple correlation, .and put 
most emphasis upon the closeness of relation. Little use was made of the 
method as a practical research too! before World War !. .although tite 
theory had long been worked out. and outlined in standard statistical 
texts, especially in Yule's early editions (2l.' Toward the third dcc.adc 
quantitative studies on the w•orking^ of the economic svstem were stimu- 
lated by Wesley C. Mitchell's pioneer work on Business Csclcs (a), and 
a few' students began to use the tool for this purpose, noi.iblv Henrv L. 
Moore (4) and Henry A. Walhacc (5). This was further encouraged in 
agricultural economics by the followers of Henry C. Taylor .and John D, 
Black. Tollowing the pioneer realistic price study on potatoc' bv Hol- 
brook Working ( 6 ) in 1922, and studies under Howard Toiicv on farm- 
management data, use of the method greatly expanded, with incre.T-ed 
emphasis on regression. 

This was facilitated by the introduction, early in the 1920’s. of speedier 
and belter controlled methods of computing the constants, as descnhai 
earlier in this book, especially in Chapter 1 1. The use of the method spread 
rapidly in a number of fields, farm management, farm prices, commodity 
markets, crop estimating, forestry mensuration, general supply and demand 
analysis, and finally (in the late 1930’s) to the inter-relations of thecconomic 
.system as a whole (macroeconomic aspects), as nnals’xed by Han>:cn. 
Keynes, Tinbergen, and others. 

A whole group of young research workers cooperated in the further 
development of correlation methods during the I920's. notably Bradford B. 
Smith. Andrew Court. Louis H. Bean, and Frederick V, Waugh. .As the 
use of these and related methods of more exact hypothesis and analysis 
spread in economic fields, a name was coined for the quantitative measure- 
ment of economic phenomena, and the international Isconomctrir Soririy 
came into existence, with its own journal, Economcrrica. A new langu.'igc 
was soon developed in this field. (A few of these terms, such as “model,” 
“exocenous" and “endogenous.” and “identification." have been explained 
in this text.) In other fields, notably in pjycltoiogy and education, a 
vigorous application and proliferation of the basic methods took place, 
though rather as a parallel and independent development. In subsequent 
vears, the regression method has been applied widely in natural and 
phy.sical scicna's. .social .scicncc,s. and in commcrci.al and industrial 
uses. 

Tlie kinds of problems to which the method has been applied will now 

» Niimt>cr!: in parentheses refer 10 references at the end of thw eSup’.cr. 



4 J 6 Uses and Phihzophy of Comfatlon and Regression Analysis 

be briefly reviewed, with somewhat fuller sampling of the earlier simpler 
studies 

Applications in Agricultural Technology 
Weather Conditions and Crop Yields. The influence of varying 
weather conditions on crop yields is an obvious cause-and eflect relation- 
ship with but httlc possibility of circular reasoning Many efforts were 
made to identify and measure the relaUonships, even before regression 
and correlation were well understood Some early studies, based on 
experimental yields, when analyzed later by simple correlation, showed 
=s 0 86 with a single weather variable (7) Analyses with more weather 
factors are generally needed to explain variations m crop yields, as 
illustrated by the corn-yield problem in Chapter 14, from Misner (8). 
and the potato-yield problem of Chapter 22, from Waugh and associates 
(9) Other studies include early ones by Moore on com (10), cotton 
yields (11), and spring wheat yields (12) Selection of vanables was 
generally based upon farmers* experience, and upon that of workers m 
research stations 

An exhaustive review by Sanderson (13) of such work m many diflcrcnt 
regions and countries emphasized that many early studies used unduly 
flexible regression curves and selected independent variables by retaining 
only those out of the many examined which showed the highest cor- 
relation with the dependent factor This involved the danger that with 
such samples variables might be retained whose high correlation was due 
to chance fluctuations rather than true relationship This resulted in 
regression equations which often did not perform well when used for 
subsequent forecasting More recent carefully controlled studies have 
given better results, as for example on wheat yields m Canada, which 
gave high correlation (7 of 0 81 to 0 94), and performed well when tested 
m forecasting subsequent yields (14) Other recent weather-yield studies 
include one m Peru (15) on potato yields 
Fisher, studying wheat yields at Rothamstcad, pointed out that it 
really made httlc difference to the growth of a crop whether a given rain 
occurred on April 30 or May 1 Yct,^ if weekly periods were considered 
for all different factors, the number of constants m the regression equation 
might be excessive He therefore applied a differential regression method 
determining the rate of change in yield with rate of change m rainfall 
throughout the growing period With only a few constants, the resulting 
smooth curve showed that the maximum effect of rainfall on yield was 
in autumn and m spring With rainfall distnbution the only weather 
variable considered, correlations ranged from 0 32 to 0 63 (16) Sanderson 
developed a simplified method of making the calculations (17) and 



437 


Types cf Problems to which eppUed 

modiftcations to introdiict; jomi functional relations uhcrc necvird. ‘fhc 
method has the disadvantage, however, that as set it cannot distmi-uisb 
between the effects of too little and ton much r.sin cr e.r.-A r-'-r.-r n-/;;'":; 
i.c.. it cannot allow for curvilinear effects. 

Physical RdaVons between Input end Output. Anotr.er of 
problem particularly important in agricultural research is to determine 
the production function; i.c.. the physical relation between the tpiantitv 
of input in production and the resulting otiipui. 

In one pioneer study the gain in weight of beef steers was related to the 
quantities of each of several different feed.s supplied, the length of feeding 
period, and to initial weight (IS). Curvilinear regressions were Rsund, 
and they showed a marked tendency towards diminishing relurn>. 

Parallel analysis has been made of milk production as related to feed 
inputs and other variables. In most of these studies (19-22) the feeds 
used in milk produced have been considered on a herd-asetage buiw 
In one study with data for individual cows, the results agreed quite well 
with those based on herd averages (23). 

Similar regression studies of the influence of physical input upon output 
have been made for potatoes (24. 25). cotton (26), and other crops. 

In one study of cotton, mi.xcd fertilizer, nitmte of .'oda. c.ilcium arsenate 
applied, and the fertility of the land (as indicated by the yield of other 
crops, notably corn) were considered as separate variables. The rc-aiits 
iilustratc some of the logical problems which arise in regression anah'ses. 
In the year studied there was a hcasy weevil d.amagc on untreated fields, 
and the .applications of arsenate increased ilic yield very materially. Hut 
these results show nothing of the cficci of poison on yield when weevil 
infestation is slight. It would be necessary to repeat the study over several 
years with varying weevil damage on untreated fields and relate the 
differences in the effectiveness of poison to climatic and other factors, before 
it vv'ould be possible to judge whether or not it would pay to u'C poison in 
any particular year — and the prices both of poison and of cotton would 
enter into the fin.al con.sideration. 

Modern c.xamplcs of such analyses (173) applied to fertilizer experi- 
mental results fitted by a joint function were shown in Chapter 21. vMih 
functions fitted algebraically, and only statistically sigmTic.ant terms 
included (Figures 21.7 and 21.8), 

Other recent applications to agricultural production problems reflect 
increased .specialization in rc.searcb. Dot charts and simple correlations 
and recrcssions arc widely used in interpreting exp/erimenia! reaiUs. partial 
and multiple correlations and net regressions less frequently'. In crop 
production, such analyses h.avc been applied so such diverse problems 
as the relation of artificially produced wlie.al hybrids to Uicir p.arcnts (27). 



438 Uses end Philosophy of Correlation and Regression Analysis 

or ph)’Sical eharactenstics of different oat vaneties to their yields (28) 
and the effect ofueather conditions on the staple length of cotton (29), m 
livestock production, to studies of livestock physiology, such as reactions 
of animals to varying temperatures (30), prenatal development rates m 
lambs (31), factors affecting lamb weights, both hereditary and environ- 
mental (32) (with four independent variables and two joint functions), 
the effects of age and nutrition on hormones in cows (33), the physiology 
of swine digestion (34), and in livestock feeding, to determine the optimum 
levels of supplying different forms of sulfur in feeding lambs (35), and 
factors affecting lamb weights at weaning (36) 

Relation of Physical Characteristics of Semples to Chemical 
Characteristics, Regression analysis has been widely used in determining 
how chemical properties could be estimated from observable physical 
properties The example of protein content of wheat estimated from 
vitreous kernels (Chapter 6), was taken from comprehensive studies (37) 
in which weight, vitreous kernels, and the region of the country where 
grown were all found significant The volume of bread js related to the 
gluten content of wheat and flour (38) The digestible composition of 
meat can be judged from the proportion of visible fat (39), the tenderness 
of the cooked meat from beef fat (40) or other characteristics of muscle 
fibers (41), or digestible starch from the per cent of crude fiber (42) 
Regression analysis may thus be used to generalize from tests under 
carefully controlled conditions, and to develop more rapid methods of 
estimating under everyday application, as in grading gram commercially 
or estimating nutritive value in the home 
Physical Appearance and Productivity. Earlier, crops and livestock 
were often selected for breeding on the basis of their outward size, shape, 
etc Correlation studies have investigated the assumptions on which 
these practices were based Studies of dairy cows by Gowen (43) indicated 
that the physical conformation of dairy cows generally had little relation 
to productive ability and studies of corn showed that size and shape of 
corn kernels, ears, and plants had little relation to actual yielding ability 
(44, 45) These studies indicated that many of the time honored points 
stressed m agpcultucal show com)peu.t«jo.s and. la hreedvog, sdectyw. hsui 
no utilitarian significance and led to a new stress on performance records 
rather than on physical appearance Other work on animal conformation 
showed that scores by qualified judges had only a limited relation to the 
actual measurements, also only limited agreement between judges m 
different regions (46) Even the progeny testing of bulls under experi- 
mental conditions has been found inadequate to predict results under 
actual farm conditions, and “the repeatability of the special station results 
in the field are very low, even in the high herds ” (47) 



Type; cf Pfcbkm^ to which cpphed 459 

Application to Physical Inter-Relations in Other Fields 

There are many other sdenlific and praciicnS held-. \M*!cre rerre*.^:^?; 
and correlation anahsis help both in she testing of h^poshc'-cv and m 
dcrh'inc rcgrc'i'^ion equations by which difnculE-to-mca^urc or nnioewn 
variables can be estimated closely from tho'-e sshich can bt rcaJilv 
mcasured or othcri already known. 

Estimating Missing Water-Flow Records. With economic df-clop- 
mcnt, the question frcqucntls arises as to whether the fiow of a trnen 
river is sufficient to establish an irrigation s)s?cm or hydroelectric tk-.elop- 
ment. or how large a dam and storage basin is needed to present slcoJ-. 
The best dam site is often at a point where few if any rarnfal! or w.itcr-How 
records ha\c been kept. As soon as th.c possibility begins to be examined, 
a stream-gauging station will usually be Citablishcd. <^0 that record' for a 
few years — say 5 to 10 jcars—becomc asailable. Hsdrologi^t-. have 
turned to regression analysis to extend these too-short rccordv, U'ln 
yveather or yyater-flow data at other locations, and relating them to th 
ayailablc data at the point desired, a regression equation is obt.amcd to 
c.xtcnd the record yvith a known degree of accuracy. One simple example 
yvas given in Chapter 5. Tsso general methods of approach arc U'cd (4Si; 
to correlate the reported floss at the desired point ssith that .at other 
points on the same river, higher and losscr if po<s)blc: or to correlate tl 
yy'ith the flosv on other nearby strcam> or rivers. Dilfcrcnces of clesaiirin 
and configuration, as svcll as any local climatic features, also must be 
considered. (F'or example, rainfall may be hcasy on one side of a disidc. 
but light on the other, so that being on the same side ssould be more 
important than actual distance.) Multiple rcgrcvMon studies intlicatc 
yshich stations give the bc.st estimate of stream flow at a dcMrcd pom!. 
A second approach is to estimate the probable ba'-in runolT from the 
available records on rainfall, temrerature. snow coscr. topographs, etc. 
Titese arc combined to estimate runofiT by elabonttc engineering calcula- 
tions, which themselves insolvc correlation studies at sannu' points. By 
proper design of the analyses, estimates based solely on calculated runoil 
from yveather records over the period yvhen local .stream-floyy records 
yvcrc not kept could be compared y\ith corresponding c'-timatcd stre.am 
floyvs based solely on flows at other points during (hO'C yc.at-.. and .an 
independent check could thus be obtained on the confidence ysnli ysnicli 
those estimates can be used. Such a check vyould reduce the d.ingcr oi tlsc 
estimates beins less reliable than their siand.ard errors indicated. B'r 
reasons staled on pages 2D7 and 4,^0. 

The problem of estimating ysatcr iloyv has an exten‘'!yc liter.’.ture ot its 
o\yn amons: fndroloeists. engineers, and other w.-irkcr-- in tir.s ticlJ <4-). 


O 



440 Uses and PhUosophy of Correlation and Regression Analysis 

including combining records of stations for differing but overlapping 
periods of time (49), and combining measures of rainfall, runoff, snow 
cover, etc with stream-flow records (50) 

>1pp//cot/ons In Industrial Engineering. Regression applications m 
engineering concern relations which could not be measured readily m the 
laboratory, such as the effect of local water characteristics on the amount 
of deposits mside water or steam pipes They could also be applied to 
relating the weathering of different paints to the varying weather conditions 
to which they were exposed Other applications in engineering have been 
suggested in recent literature (169) Studies are reported in the Netherlands 
of the relation of temperature to gas consumption (170),"5Hd in Brazil 
of an equation which related ultrasonic velocity in organic liquids to 
their refractive index and molecular structure (171) From the published 
work, applications in industry seem to be much less frequent than in 
other fields Thai may be because here, as with industrial price studies, 
the work that is done is usually kept for pnvaie use within the business or 
corporation 

Estimating Volumes of Irregular Objects, The example m Chapter 22 
1 $ an illustration of deriving practical approximations to measurements 
that can be made exactly only with great difficulty and expense Estimating 
the volume of usable lumber in a standing tree is another important 
application Height of a tree and diameter of circumference near the base 
can easily be measured, but the exact content of usable lumber can be 
exactly determined only after the tree has been felled, trimmed, and the 
trunk then measured with great care This problem is of great practical 
importance in practical logging, forest surveys, and forest expenmental 
plots, which depend on accurate estimates of size and growth ntthout 
cutting the trees Multiple regressions have had significant success in 
denving formulas and alignment charts for various species that are easy 
to use in practice and that have a high degree of reliability Work m this 
field contributed to the early development of multiple-regression methods 
(51), and subsequent work has produced a sizeable special literature 
(52-54), with special methods of its own (181) 

4pp//cot/ons in Artronamy. AxixQUAnxy, aiu wact cakulAted 

the movements of the various heavenly bodies by fitting equations to the 
observed positions, obtaining almost precise fits The use of fitted 
equations was routine in this field long before they began to be used in 
less exact sciences through regression methods In modem times, however, 
with spectroscopic and other new information, galaxies number in the 
thousands and millions, with the Einstein hypotheses and atomic-energy 
discoveries, statistical methods began to be applied to such problems as 
general laws of brightness, shift m color, apparent speed and distance, 



Types cf Problems to which applied 44 f 

population of caiaxics in ipacc. and the whole idea of the cver-cxrardjnc 
universe. Studies in this field now use rccrcscion n^e;l'.od< e\tend\d\'. 
but the literature is too vast to gisc many references The btrs; 

application is in computing the orbit of a “sputnik’* from data reported b% 
many different observers. 

Applications in Agricultural Economics 

Most of the problems considered to this point deal with relationshij-n 
in existing universes or ones likely to remain substantially unchanged over 
the relevant time. But many problems in economics and other social 
sciences deal with universes that arc subject to human influence and change 
at least gradually over lime. Non-cxperimcntal data that exist only in the 
form of unique observations for each successive interval of time arc 
particularly hard to appraise in terms of random .sampling from a clearly 
defined universe. 

The conceptual advances made in recent years by Koopmans, Wold, 
and others concerned with time series have been noted in Chapter 20, 
The sample-universe relationship can be established reasonably welt in 
economic data, time scries or other, if one is careful to define (and s.amp!e 
from) a universe that docs not change materially over a specified period. 
For c.xample, although trends in com yields have no sampling significance 
with respect to later trends, the universe of net relations between rainfall, 
temperature nnd yield deviations from trend may very well remain constant, 
or nearly so, for many years. Similarly, in a regression equation explaining 
changes in the price of a farm commodity, one coefficient might remain 
nearly constant for many years; another might change gradually over 
time for fairly clear reasons; and a third might change quite radically 
over a five-year period again for dear or at least plausible rc.avon'-. The 
reasons may be demonstrable only if a dense network of current (and 
accurate) economic data exists for the commodity. 

It is not surprising that some of the earlier regression an.al«cs of 
problems in agricultural economics, particularly those based on time 
series, suffered from ambiguities in defining the universe with respect to 
which inferences from a given set of ob.vcrvations were justified. On the 
other hand .some of the analyses made in this field during the 192(\\ 
would meet all present-day standards of workmanship and statistical 
reliability. 

Farm Values as Related to Farm Characteristics and Other Foaors. 
Estimates of the sales value of farms arc needed for levying taxes, <ecyring 
loans, or setting an offering price. Objective methods for making such 
estimates arc therefore verv' important. Such estimates involve (1) the 



442 Uses and Pbifosophy of Correlation and Regression Analysis 

values of individual farms at any given time as affected by their character- 
istics and their rights to share in public subsidies or payments, and (2) 
changes m the general level of farm pnees, with shifts in farm pnees and 
taxes, and public controls, subsidies, etc There is a specific universe, for 
the first, for each given period of time, and a large universe of farms which 
can be sampled The second, with changing prices, costs, and public 
Jaws and regulations, involves time-series analysis Since there are millions 
of farms m a large country such as the United States, operating under 
different physical, social, and even economic conditions, great geographic 
variations are available for study, and analyses can be made simultaneously 
in time and space 

The simpler problem was attacked first in a pioneer study by Haas (58) 
Actual sales prices of a large sample of farms in a given region were 
obtained, and also relevant facts such as distance from town, value of 
buildings, proportion of crop land, fertility of the soil, and type of road 
on which the farm fronted Trends in land values were first eliminated, 
and the adjusted acre pnees were related to the other factors mentioned by 
linear multiple correlation Acre values estimated from the independent 
factOR had a standard error of $I9pcr acre, with » 0 81 As assessed 
valuations showed a much larger error, it was suggested that the impartial 
regression equation might be substituted for the less reliable human judg- 
ment in assessing individual farms for taxation purposes A similar analysis, 
made much later with the same objectives (18^, gave parallel results 

In another study a joint function was found for farm dwellings, with an 
expensive dwelling adding more to the value of large farms than of small 
ones The net relation of value to type of road was determined by the 
method of Chapter 22 Preliminary work on this study, with farm values 
stated on a per-acre basis, gave a linear correlation of ^ = 0 98 This was 
due to the presence of a few very small farms, which showed farms and 
building values per acre both running into thousands of dollars With 
these farms excluded, the linear multiple correlation dropped to /? = 0 64, 
indicating the extent of the previous spunous correlation Using curvi- 
linear relations and joint functions, the final analysis showed / = 0 77 
With 368 observations available, more complex methods could be employed 
than would be feasible m most cases (59) 

The second question was dealt with m a study by Chambers over the 
long period of rising land values pnor to 1920 Farm-land values then 
reflected not only changes in farm incomes, but also discounted the 
expected continued nse m incomes Further analysis covering the 1929 
depression. World War 11 and the subsequent inflation period, and the 
changing public interventions, would have to examine far more complex 
issues 



44J 


Types of Problems to which applied 

Relation of Farm Organization to Farm Income. The c^inTt in 
of enterprises which wil! produce the best returns to farrr-.ct- in a ri-.er. 
locality has been much studied iKinc data from farm su.wev<. detuhed 
farm accounts, and other sources. IniiialK. the dcita su're urnahvcsan'mctl 
by simple sorting, often based on the dependent \ariable. with diilcrcrrev 
in farm income ascribed to concomitant differences in ilte ascrarc valuer 
of the other variables. Application of multiple correlation to such dat.t 
gave different and more specific conclusions. Studies based on sof.cvs ir, 
Pcnn.sylvania (59), Iowa (60), and Virginia ffil) considered such s.irlabics 
as farm sire, crop acreage, size of important liscstocK enterprise (number 
of cows or hogs), efficiency of crop production and Incstoch production, 
and capital investment. About half the sariation m earnings from farm 
to farm was usually explained. The dominant influence ssas usuaiis the 
size and efhciency of the major enterprise, siicfi as the corn and liog 
enterprise on Iowa hog farms: tobacco acreage, \ield and qtialiis. on 
Virginia tobacco farms; and the number of com and cflicicncv of milk 
production on Pennsyh ania dairy farms. Man) similar studies sscre made 
in other areas. 

The value of such .statistical studies is, hov.c\cr. distinct!) limited. 
The results hold true only for the year of the data, due to fluctuations in 
yields and prices. Each indis'idual farm is a different emit), and the 
combination of enterprises which produces llie best results on the m croze 
will not necessarily be the bc.st for any one indnidual farm. If it verc 
possible to ob-scn’c one farm under a hundred different lyp-es of organi- 
zation, but svith the same price and yield conditions, and record the 
resulting profit secured under each organization, it would then be possible 
to determine what combination would yield maximum returns for that 
farm under the stated conditions. This can. howcvci', be estimated b) the 
farm-budget method of analysis (56. 62j, which computes the probable 
income under alternative combinations of enterprises on a specified farm, 
and with varying intensities of operation and with selected combinations 
of prices and costs. Such an analysis can proxide a guide to the mo.t 
profitable farm organization for any desired combination of conditions 
(57), This method has therefore largely replaced o\er-nH regression 
analysis based on large numbers of farms. The latter served to idcntifx 
the major factors involved (172) and .still assjsjs in the input-output 
an 3 ly.sis (IS) needed in farm-budget estimates. Since 1950 a more c.xact 
method of analysis, linear programming, has been applied, most cxter.Ms ciy 
by Ear! Heady (17S, 1S2). in choosing input-ouiput combin.ations that 
will maximize net income. 

Efficiency of Organization of Marketing Units. Regression .in.tl) sis 
has been applied in studying the cfficienc) with which individual market 



444 Uses and Philosophy of Correlation and Regression Analysis 

facilities— elevators, flour mtlls, milk-rccciving stations, etc —are organized 
and operated Data from a sample of enterprises are analyzed wth respect 
to the relation of direct and overhead costs per unit, to size, capacity 
utilized, organization of physical facilities, etc Such studies provide 
managers and coopicrativc or business concerns with indications of points 
to watch for the most efficient results (63, 64) 

Relation of Commodity Prices to Economic Conditions 

Most price studies involve timc-scrics analysis, with only a single 
observation available for any given period of time However, there is 
usually enough continuity m the way that individuals react in the aggregate 
that fairly stable results can be secured from such senes Where the change 
in reaction is continuous and progressive, that trend itself can be made one 
variable m the analysis 

Price studies may be separated broadly into (I) those which follow the 
usual single^equation regression technique of taking one factor as 
dependent upon several other variables considered as given or independent, 
and then determining how the first may be estimated from the others, 
and (2) those that take two or more factors as simultaneously inter* 
dependent, and investigate the most probable nature of that inter* 
dependence by a system of equations simultaneously determined Several 
other factors differing from equation to equation may, however, be held 
constant or taken into account at the same time (see Chapter 24) 

When single equation studies are used, pnees may be examined with 
any one of several factors regarded as dependent, notably (1) central 
market (wholesale) prices (2) consumption, and (3) production 

(I) Factors Affecting Central Market Prices. The earliest studies 
were those investigating the effect of production or total supply upon price, 
at some representative market These studies take as independent variables 
factors such as production or supply for the season, and current general 
business activity, which are either predetermined or independent of the 
market price, and then study Iheir apparent net influence upon the price 
of the given product This avoids the circular reasoning that may be 
involved if pnees of one product arc related to independent variables, 
such as prices of competing products, which may in turn be influenced by 
the pnee which is to be explained 

Annual Prices The simplest price studies are those which relate the 
market price for an agricultural commodity to the supply for a marketing 
year Early work by Moore (65) indicated the general relation of pnee to 
supply for corn, hay, oats, potatoes, and cotton, with changing conditions 
eliminated by first differences Subsequent work on potatoes (66-68), 



446 Uses end Philosophy of Correlation and Regression Analysis 

competing meats, and business actisity, and trend and seasonal factors 
simultaneously determined Monthly pnee studies have also been made 
for perishable and canned fruits (187-189) 

WEEkLV OR Daily Prices For very perishable products, price studies 
need to deal with the average prices for each week or even for every day 
Early studies of this tjpe are those of watermelons by Heddcn and 
Chemiack (78) and of peaches by Kantor (79) for New York City Both 
studies took into account variation in demand during the week Tempera- 
ture had a marked influence on watermelon prices 

Similar studies for shorter or longer time intervals have since been 
published for other perishable products, including studies on canteloupes 
by Rauchenstinc (80) on Bartlett pears by Hoos and Shear (81), on 
Louisiana strawberries by Mehren and Erdman (83), and on canned and 
citrus fruits by Hoos (187-189) 

(2) The Effect of Price upon Consompt/en. This was examined m 
many early studies, especially of milk prices upon consumption (86) 
Later intensive studies have investigated demand curves m particular 
markets both by time-senes analyses, and more generally by studies in 
space between regions or countries Where ultimate consumption is 
being examined, retail price should logically be used to study consumption 
responses (72, 84, 85) In the United States differences m real income in 
time, or per capita in space, have been included as independent variables, 
and logarithmic transformations have been used to yield separate measures 
of the elasticity of consumption with respect to prices and real income (87) 
Extensive modern work on this field has been done by Stone in England 
(88) and by Fox and others in the U S Department of Agriculture 
Wold and Juteen’s rigorous study of demand provided a specially detailed 
examination of the hypotheses involved and of appropriate methods, in 
the light of these hypotheses (89) Independent measures of the income 
elasticity of demand derived from family budget studies agree moderately 
well with those derived from time senes and intraregional or intercountry 
studies, but in some cases the range has been wide (90) Work has also 
been done in Canada on factors affecting consumption of beef and pork 
(91), and meat generally 

Early studies for milk (92) showed that the other factors had a larger 
influence upon consumption than did pnccs With cotton, price alone, 
and an upward trend m demand, almost completely determined world 
consumption (93), but United Stales consumption was also influenced by 
industrial activity Later work has taken into account competitive 
synthetic fibers, as these become increasingly important For potatoes, 
demand is very inelastic, and in years of low prices producers feed much 
larger quantities to livestock or allow them to go to waste When the pnee 



Types of Problems to which applied 

falls very low, much of the supply is left in the ground undiu- \'}jk T!.!- 
’•reservation demand” by producers in\t)l*,c-> a concurrent adm^ttrens nf 
supplv to price, which might need anal\sis bv the method' or'djarter 
(72, pp. 10-11). ' ■ ^ 

Studies of both cocoa and sugar consumption (9.5. .S 7 ) - been irrule 

in different countries with respect to the effect ofb.ith prices and .'.scracc 
incomes on consumption during the same period of time. .Sundar stud’cv 



Fij. 2S.I. Per capita consumption of calorics per do in •..anous 
countries in relation to national income per c.'pita Cros'cs !rd’;,''e 
consumption for human food: solid dots, consumption for fo >.!, live- 
stock feed, and other purposes. (Pood .and .Xpriculmre O'e.’i'i.v.tio'i 
of the United Nations. Shiir of food o'-.l A^ricdur,’, p 1'^. 1 



20 30 40 SO6O;0 90100 200 300 400 900 8001D00 2 000 3 000 

Toul I V ng eip«nd tu « p«f c«p U per year n 1943 dollars 

FJj IS 1 Antiuil food expenduures pet capiu m vaiipus counWi« 
for various levels of income per capita in each country 

1 Inda 1954 Fanndabad To^vnsh p 9 Finland 1950-1951 Urban married 

2 Ceylon 1952 1953 Total population couples 

3 Ghana 1955 Kumasi 10 Panama 1952 1953 Panama City 

4 Japan 1954 Towns of over 50 000 II Switzerland 1936-1937 Workersand 

inhab lants employees 

5 Portugal 1950 1951 Porto 12 Sweden 1952 Total population 

6 Portugal 1948-1949 Lisbon 13 Sweden 1948 Urban families with 

7 Austria 1952 1955 Towns of over chldrcti 

10 000 inhab tants 14 Canada 1948 Total non agricultural 

8 Ireland 1951 1952 Total towns and population 

villages farmers usually excluded 15 Canada 1953 Five large towns 
\JnAcfiSla'tes Yaigecnw 

seasons (99) Subsequent experience showed however, that continued 
high prices for two seasons had greater influence than for a single season 
(100) With hogs It took eighteen months for price changes to be reflected 
substantial!) in market receipts (75) Corn prices were as important as 
hog prices in affecting hog production Studied separately for different 
type of farming areas there were marked differences in responses in 
differenl areas to the corn/hog pnee ratio, depending on the position of the 




450 Uses and Philosophy of Correlation and Regression Anolyslt 

relative importance of the different channels of influence (net regression 
coefficients) The diagrams t>picanj implied that some elements could be 
estimated by single-equation methods, but that others should logically be 
estimated by the simultaneous equations methods of Chapter 24 (see also 
193) Such studies include one on wheat (105) based on six structural 
equations which provide a complete hypothesis as to the direction and 
nature of the influence of all the variables m the system Many were 
fitted by both the least-square (single-equation) and the simultaneous- 
equations approaches, with very little difTerence in the results from the 
two methods Other parallel studies have been made on the feed-livestock 
economy (106), corn and total feed concentrates (107), dairy products 
(191), food fats and oils (108), and coffee (183) A comparable German 
rev lew used both single equations and simultaneous equations, and includes 
an mtcreslmg summary of the nature of all the cross-elasticities among 
commodity groups (109) Sugar was similarly studied m Austria (199) 

Factors Related to Price Margins. Other studies consider prices at 
different points in space or at different steps in the marketing process 
Regression analysis has been used to measure the relative influence of 
changes in freight rates, location of supplies, and price level on the margin 
between potato prices in Minneapolis and New York (111) Since pro- 
ductiori affects w holesale prices most directly. /orm prices affect production, 
and retail prices affect consumption a complete explanation of the chain 
of price-supply-consumption events must also explain and analyze the links 
between the several market stages, farm-wholesale-retail This phase of 
the problem seems m genera! to have been less intensively studied by 
means of regression analysis 

Ae/at/on of Characteristics of Individual Lots of a Commodity 
to Prices. The price studies discussed treat changes in prices from time 
to time, for lots of the commodity of uniform or of average quality and 
usually at one selected stage of the marketing process Except where 
different observations are made in space, only one unique observation 
can be drawn from each successive time period In determining why 
different lots of the same commodity, sold within a given penod and at the 
same stage of the marketing process^ should sell for different prices^ 
sampling theory is more directly applicable There is a true universe— all 
the sales of the specified kind taking place within the specified period— and 
as large a sample as is desired can be secured, up to the limits of the 
universe Studies of farm prices arc one example of this type, and the 
relation of the price of apples to size, insect injury, and scab, is another 
(Figure 21 10) 

One study related the prices of different lots of asparagus to the length 
of green color, stalks per bunch, and uniformity (112) The conclusions 



Factors Influencing Pnces of Native Asparagus 

Data Gathered on 2tX) Boxes Sold on Boston Market-1927 
Green 38^ * Extra Per Doaen Bunches Was Received For Each Add t onal Inch of Green 


4$2 


Uses and Phthsophy of Correlation and Regression Analysis 



J264 J256 1249 5241 5233 

Massachusetts Department o f A^ulture 

Fig. IS.3. A pictorial presentation of conclusions reached by a multiple-correlation study 
(From Fredendc V Waugh ) 




454 Uses and Phllosap/iy of Correlation and Regression Analysis 

operate quite ditTcrently from the v.ay they operate under freer atomistic 
competition (128) The situation is further complicated by the develop- 
ment of powerful labor unions in many industries which may at limes raise 
their wages with little regard to increases in productivity Great care is 
then necessary in setting up any statistical analyses m ways that represent 
the actual market situation Regression has also been used in judging to 
what extent increases in productivity in various industries were reflected m 
corresponding wage increases or price decreases (129) 

Determination of Utility Rates A senes of studies for public and 
private streetcar, bus and other utility concerns related the density of 
traffic to fares frequency of service, levels of employment and income, 
and motor vehicles per capita These analyses served to forecast the 
probable effect on the volume of trafiic and earnings from proposed 
changes in rates or service with good accuracy Sometimes they have 
shown that a proposed rate increase would so reduce traffic as to cause a 
loss, and a reduction m fare would increase net income Since these 
studies are made as a commercial service, they have not been published 
Parallel studies have been made of the effect of rates and other factors on 
the demand for taxicabs m Amsterdam (131), and on the relation of the 
density of automobiles in different countries to income per capita and 
taxation per car (132) by Tinbergen and his associates 

Factors Affecting Internotlonol Trode Studies of factors affecting 
the volume and direction of international trade have measured elasticity 
of import demand with respect to price in various countries over various 
periods of time and in some cases elasticity with respect to income A 
recent study in this field (190) came to the conclusion that "the price 
mechanism works powerfully and pervasively m mlernational trade,’ 
postwar as well as prewar 

Size Stondards for Children’s Clothes. A study of appropriate size 
standards for children s clothes examined body dimensions for thousands 
of children, together with their age. sex, and race Height and girth at 
hips were found most important in judging size as a whole, with age 
having no net effect A new system of clothes sizes, based on the joint 
distribution of these two measurements was recommended to clothing 
manufacturers (133) 

Sales Quotas for Local Districts. Corporations planning sales 
campaigns, advertising budgets or location of branch offices, need to 
estimate the sales potential of individual counties or other local units 
Facts about the county, such as population, income, value of farm 
production, etc , arc related to past ^es by multiple regressions, and 
used to estimate future prospects Naturally the weighting equations are 
very different jf farm machmeiy, automobiles, or motion pictures are 



4S6 Uses and Philosophy of Correlation and Regression Ajialysls 

lime a sharp industrial depression of the sort that happened twice after 
World War I, and that such mild recessions as did occur were soon 
corrected Regression analysts has also been applied to explain tech* 
nological change and economic development (195, 196) New problems 
ha\e arisen — those of creeping inflation and of rising pnee levels around 
the world — despite farm surpluses and steadily rising output of goods and 
serticcs in nearly all countries Such problems too, are being subjected to 
careful theoretical and quantitative analy'sis, which may help the nations 
concerned to find efTective solutions to them too 


Correlation and Regression Methods In Political Science 
and Politics 

Statistical analysis has been extensively applied to political behavior 
TheGallupPoll the Roper Poll andothershavcbecomealmosthousehold 
words Along with these, regression analysis has been used by Bean and 
others to establish many political relationships, such as those between 
votes by states and by the nation and to develop the predicting reliability 
of opinions or votes in particular areas (142, 143) Regression methods 
have also been used in detailed studies of political structure and behavior 
in particular cities or localities (144) 

Correlation Methods in Psychology and Education 

In educational and psychological investigations correlation and 
regression methods are applied to the study of such problems as the 
relations of grades in different subjects scores on different mental tests 
(146) or the relation of scores on mental tests to success m the school* 
room (147) or m later life (148) 

Studies have also been made of the relation of mental and physical 
characteristics to success in different occupations (149) and the relation 
between civil service salaries and a battery of job characteristics (180) 
which provided a basis for rating other jobs by the regression equation 

In psychological problems correlation analysis has been used pnmanly 
t-c. TtanA'cm clc/strirss Cp5 itVan'itmsVnp One Vcaity Ihe contusion 

that even in groups of the same economic and social status, there is a 
small negative correlation between number of children per family and 
intelligence (150) In another study a given test was repeated with varying 
time to complete it, and it was concluded that the test determined power 
alone, rather than speed (151) Here basing the conclusion on /■(076) 
instead of d (0 58) led to overslrcssing the significance of the observed 
correlation Other applications of correlation or partial correlation in 



458 U$es and Phlhzophy of Correhthn and Regression Analysis 

with a median about 0 30 The judges thus disagreed widely m their 
ratings of patients by this test (163) 


Tests of Correlation and Regression Results 

It has been possible to verify or revise some earlier studies by app!)ing 
them to later data, or by comparing them with analyses for longer periods 
Studies of the response of milk production to prices received, for example, 
gave results quite different from those of earlier studies, and led to the 
conclusion that factors important while the industry was expanding m a 
given region, did not have the same signihcance after maturity was reached 
(164), le, the economic response was irreversible More intensive 
analysis of the organization of typical farms provided conclusions as to 
the long-run response of production to price (165) In a different case, 
the response of milk production to variations in feed input on farms was 
later tested by a senes of feeding expenments The analysis of these 
results (166, 174), showed a net relation of mill, output to feed input 
which agreed rather well with the earlier net regression (167), based on 
cow-test association records 

Enough has been presented to illustrate the wide range of problems m 
which the use of regression analysis sheds new light on actual relationships 
These illustrations indicate the necessity for careful logical analysis, and 
the need both for good theoretical knowledge of the field m which the 
problem lies and for thorough technological knowledge of the elements 
involved in the particular problem 

Only a few of the significant statistical studies m any one field have been 
included In many cases an important study has not been referred to 
because the point was already covered, or an unimportant study has been 
mentioned because of its pertinence to a particular topic This discussion 
should therefore not be regarded as a critical evaluation of the work m any 
of the fields touched upon Instead, the comments are intended solely to 
develop the variety, complexity, and significance of the problems to which 
regression and correlation analysis may be applied, and the care and 
thought which are even more necessaiy than the statistical computations if 
the results arc to be of lasting value 

REFERENCES 

1 Pearson.KaTl.Thelawofanceslratheredrty, Biomfirjfco, Vol 11, pp 211-236, 1905, 

, and Alice Lee, On the Jaws of inheritance m man, I Inheritance of 

physical characters, ftowfriito, Vol II, pp 357-462, 1903 
2 . Yule, G Udny, An Introduction to the Theory of Statistics, 6th edition. Chapter 
XIJ, Charles Griffin, London, pp 229-253, 1922 



460 Uses and PhUosophy of CorrehVon and Regnsslon Anatfsh 

26 Westbrook. E. C, and others. An economic study of farm organization m Sumter 

County, Ca Stale College of Agr Hul 324, pp 82-87, December, 1927. 

27 Dell, G D U . Mary Lupton, and Ralph Riley, Investigations in the Triiicinae, III 

The morphology and field behaviour of the Ai generation of interspecific and 
intergeneric amphhdiploids, Jour Agr Science, 46, part 2, pp 199-231, 
August, 1955 

28 Grafius, J E . The relationship of stand to panicles per plant and per unit area in 

oats, A^ron Jour , Vol 48, pp 460-62, October, 1956 

29 Hanson, R G,E C Ewing, and E C Ewing. Jr , Effect of environmental factors 

on fiber properties and yields of DclCapine cottons, Agron Jour , Vol 48, 
pp 573-381. December, 1956 

30 Casady. R B , J E Legates, and R M Myers, Correlations between ambient 

temperatures varying from 60“-95* F and certain physiological responses 
in young dairy bulls, .four Agr Scieiice,\o\ 15, pp 141-152, February, 1956 

31 Joubert, D M , A study of the pre natal growth and development in the sheep. 

Jour Agr 5'cie/iee, Vol 47, pp 382-428. August, 1956 
32. Bamicoat, C R , and others. Milk secretion studies with New Zealand Romney 
lambs. Jour Agr Science, Vol 48, pp 9-34, October, 1956 

33 Armstrong, David T , and William Hansel. The eifect of age and plane of nutrition 

in growth hormone and thyrotropic hormone control of pituitary glands of 
Holstein heifers. Jour Animat Science, Vol 15, pp 640-649, August, 1956 

34 Castle, Elizabeth J . and M E Castle. The rate of passage of food through the 

alimentary tract of pigs. Jour Agr Seienee, Vol 47, pp 196-204, Apnl, 
1956 

35 Albert, W W and others. The sulphur requirement of growing fatteninglambs in 

terms of methionine sodium sulphate, and elemental sulphur, Jour Agr 
Seienee, Vol 15. pp 559-569, May. 1956. 

36 de Baca, R C . and others, Factors affecting weaning weights of cross bred spring 

lambs. Jour Amer Animat Sciences. Vol 16, pp 667-678, August, 1956 

37 Shollenberger, J H , and Cofinne F Kyle. Correlation of kernel texture, test 

weight per bushel, and protem content of hard red spring wheat. Jour Agr 
Res Vol 33 No 12 pp 1137-1150, Dec 15, 1927 

38 Coleman D A,H D Duon andH C Fellows, Compansonofsomephysicaland 

chemical tests for determining the quality of gluten in wheat and flour. Jour. 
Agr Res Vol 34 No 3.pp 241-246. Feb I. 1927 

39 Chatfield Charlotte, Proximate cotnposKton of beef U S Dept Agr Circular 

389, 1926 

40 Cover, Sylvia, O D Butler, and T C Cartwright, The relationship of fatness m 

yearling steers to juiciness and lendemess of broiled and braised steak. Jour 
Animal Science Vol 15. pp 464-472, May, 1956 

41 Wang, Hsi and others. Extensibility of single beef muscle fibers. Jour Animal 

Science, Vol 15. pp 97-108, February, 1956 
4Z Smith, Allan N , and Brynmor Thomas, The nutritive value of Calluna tulgaris, 
IV Digestibility at three, seven and fourteen years after burning, Jour Agr 
Res , Vol 47, pp 468-473, August, 1956 

43 Gowen, John W . Studies on conformation in relation to milk producing capacity 

m callie, yt»ur Dairy Saence, Vci III, No 1, January, 1920, Vol IV. No 5, 
September, 1921 

43a , Conformation and milk yield in tlw light of the personal equation of the 

dairy cattle judge, A/mnc ,4^ Expt Sta But 314, 1923 

44 Wolfe.T K , Abiometncalanaiysisorcharacters ofmaizeandof Iheirinhentance, 

Va Agr. Expt Sta Tech But 26, 1924 



462 Uses and Philosophy of Cenehthn and Regression Analysis 

64 Schoenfeld, Wiliam A , Some economic aspecis of the marketing of milk and 

cream in New England, U S Drpt Agr Cire 16, pp 24-29, 1927. 

65 Moore, Henry L., loc at 

66 Working. Holbrook, ref 6 

67 , Factors affecting the price of Minnesota potatoes, Mmrt Agr Expi Sia 

Tech Bui 29. 1925 

68 Waugh. Frederick V , Forecasting prices of New Jersey white potatoes and sweet 

potatoes, N J State Dept of Agr Cue 78, 1924 

69 Kiltough, Hugh D, What makes the price of oats, 1/ .5 Dept Age But 1351, 1925 

70 Smith, Bradford B , The adjustment of agricultural production to demand, Jour 

Farm Econ , Vol VUl.No 2.pp 163-165. April. 1926 

71 Bean, Louis H , Some interrelationships between the supply, price, and con- 

sumption of cotton, U S Dept Agr.Bur Agr Econ , mimeographed report, 
April, 1928 

72 Fox, Karl A , The analysis of demand for farm products, U S Dept Agr Tech 

Bui 1081, September. 1953 

72a . Factors affecting the accura^ of price forecasts, Jour Farm Eeon , Vol 

XXXV, pp 323-340, August, 1953 

72b , Econometric Analysis for Public Policy, Iowa State College Press, 1938 

73 Smith, Bradford B , Factors affecting the price of cotton, 1/ S Dept Agr Tech 

Bui SO, 1928 

74 Haas, O C , and Mordecai Ezekiel. Factors affecting the price of hogs, U S Dept 

Agr Bui 1440, 1926 

75 Ezekiel, Mordecai Two methods of forecasting hog prices. Jour Amer Stat 

^Jwe.Vol XXII pp 22-30, March, 1927 

76 Hanau, Arthur, Die Prognose der Schweinepteise, Vierteljahrshtfte iur Kunjimkiur> 

forschung. Sonderheft 7, Insiitut for Konjunkturforschung Berlin, February, 
1928 

77 Ezekiel, Mordecai, Factors related to lamb prices. Jour Pol Eeon , Vol XXXV, 

No 2 April, 1927 

78 Hedden W P and Nathan Cherniack, Measuring the melon market, Prehm 

report (mimeographed) U S Dept Agr Bur Agr Econ , m coop N Y City 
Port Authority August 1924 

79 Kantor. Harry, Factors affecting the price of peaches in the New York City market, 

U S Dept Agr Tech Du! 115 1929 

80 Rauchestine E Economic aspects of the canteloupe industry, Cahf College of 

Agr Bui Ai9 1928 

81 Hoos, Sidney and S W Shear Relation between auction prices and supplies of 

California fresh Bartlett pears Hilgardia.Vol 14, No 5, pp 233-319, January, 
1942 

82 Foytik Jerry, Characienslicsof demand forCalifornia peaches, //i/^ardia, Vol 20, 

No 20. April 1951 

83 Mehrcn, G L and H E Erdman, An approach to the determination of intra- 

seasonal shifting of demand. Jour Farm Econ , Vol 28, No 2, pp 587-596, 
May, 1946 

84 Fox, Karl A , Factors affecting farm mcomc, farm prices, and food consumption. 

U S Dept ofAgr./(?r Econ Res, Vol 3, No 3, pp 65-82, July, 1951 

85 , Changes in the structure of demand for farm products. Jour Farm Econ , 

Vol 37. No 3,pp 411-428. August, 1935 

86 Waite, Warren C . and Henry C Trclogan. Agricultural Market Prices, 2d edition 

440 pp , John Wiley and Sons and Chapman and Hall, 1951 

87 FAOCommod Senes No 22, pp 60-74. ^lember. 1952 



4M Uses end Philosophy of Correlation and Regression Analysis 

113 Dtedjens. V A , W D Whitcomb, and R M Koon, Asparagus and its culture, 

Afass A^r Collf^e Exiens 49, April, 1929 

114 Howe. Charles B , Some local market jwice characteristics which affect New Jersey 

e^ producers, factors affecting the retail prices of eggs, iV / Ajr Expt Sta 
Bui 150. 1927 

115 Benner, Claude L , and Harry G Gabriel. Marketing of Delaware eggs. Del Agr 

Expt Sia Bui 150 1927 

116 Kuhrt, W J. A study of fanner elevator operation m the spring wheat area 

Series of 1925-26, Part H Analysis of the variation in the quality factors of 
the 1925 crop of spring wheat, and the relation to such variation to price 
received and premiums paid in 1925-26, U S Dept Agr , Bur Agr Econ. 
preliminary report, October. 1927 

117 Schultz, Henry, The Theory and Measurement o] Demand, Umv Chicago Press. 

1938 

1 1 8 Shepherd, Geoffrey S , Agricultural Price Analysis, 4th ed.. Iowa State College Press, 

1957 

1 19 Thomsen Frederick Lundy, and Richard Jay Foote, Agricultural Prices, 509 pp, 

McGraw Hill, New York 1952 

120 Waite, Warren C and Harry C Trclogan. Agricultural Market Prices, 2d edition, 

440 pp , John tViJey and Sons, and Chapman and Hall, New York and 
London, 1951 

121 Hearings before the Temporary National Economic Comm , Part 26, Iron and Steel 

Industry, Exhibit 1416 An analysts of steel prices, volumes and costs— eon* 
trolling limitations on price reductions, pp 14,032-82, Washington, 1940 
122. Wylie, Kathryn H, and Mordecai Ezekiel, The cost curve for steel production. 
Jour Pol Eeon,\o\ XLVIlI.pp 777-821. December, 1940 

123 Dean Joel. Statistical cost curves in various industries, Eeonomeinca, Vol VIII, 

No 2.p 188-189 April, 1940 

124 Hearings before the Temporary National Economic Comm , Part 26, Iron and 

Steel Industry, A sutistical analysis of the demand for steel, 1919-1938, pp 
13,913-13 942, Washington 1940 

125 Roos, C F, and Victor von Szeltski, Factors governing changes m domestic 

automobile demand. The D)mtmies of Automobile Demand, General Motors 
Corp , New York, 1939 

126 Derksen.J B D .Long cycles in residential building an explanation, Ecow/nernfo, 

Vol Vni, No 2, pp 97-116. April, 1940 

127 Koopmans T , Tanker Freight Rates and Tankship Building, Netherlands Economic 

Institute, London, 1939 

128 Chamberlin Edward The Theory of Monopolistic Competition, Harvard Univ 

Press, Cambridge, 1936 

129 Ezekiel, Mordecai, Distribution of gains from rising technical efliciency in pro- 

gressing economics, /4mer Econ Reo.Vol XLVU.Nd 2, pp 361-375,1957 

130 Douglas, Paul H , and Grace Gunn, nw Production Function for American 

Manufacturing in 1919, Amer Econ Review, Vol 31, pp 67-80, March, 1941 
See also Paul H Douglas, Thewy of IPi^es, Macmillan Co , 1934 

131 Opmerkingen naar aanleiding van Jwi rapport, Het Amsterdamse taxivraagstuk, 

Ncderlandsch Economisch Institut, Rotterdam, August, 1953 

132 Keasberry.J E,L.H Klaassen,an<l J Koopman.Dcinvloedvandebclastingdruk 

op het aantal personenauto’s, K'egen de Vereniging Het Nederlandsehe 
fVegencongres, IJecember, I9S4 

133 Girschick, Meyer, and Ruth OBnen, Children's body measurements for sizing 

garments and patterns, U S Dept Agr Misc Pub 365, 1940 



46i Uses ond Phlhsophy of Correlation and Regression Analysis 

159 Siegel. Sidney. Son paramtlne Sfoiufics for she Behatlorol Seietiets, McGniw VliH 
Series m Psychology, pp 312, 1956 

150 Stephenson. William. The Study of Behttttor, Q Technique and its Meihodoloe), 
Univ of Chicago Press, pp 376, 1953 

161 Spearman, C , The factor theory and ils Iroubles. 1 Pitfalls in the use of probable 
errors, /our Edue PrycAof, 1932, If Garbling the evidence, Jour Educ 
Psychol. October, 1933. Ill Misrepresentation of the theory. Jour Edue 
Psychol, November, 1933, IV UnK^ueness of G Jour Edue Psychol, 
February, 1934, V Adequacy of Proof, Jour Edue Psychol, April. 
1934 

162. Thurstone, L L , The yeeiors of Stind, Maitiple factor Analysis for the Isolation 
o/ Primary Tratir, Umv Chicago Press. 1935 

163 Gelfand, Leonard, Bruce Quamngton, Harley Widemand, and Jean Brown, 

Inter judge agreement on iiaits rated from the Rorschach, Jour Consult 
Psychol, \q\ 18, No 6,1954 

164 Mighell, R L.andR H Allen, Supply schedules — long-time” and “short time." 

Jour Farm Eton , Vol XXII, No 3, 1940 

165 Allen. R H Erling Hole, and R L Mighell, Supply responses m milk production 

in Cabot Marshfield Vermont, V S Dept Ajr Tech Bui 709, 1940 

166 Jensen, Einar, Determining input-output relationships in milk production, U 5 

Dept Agr Farm M^t Reports 5, Sintary, 1940 

167 Ezekiel, Mordeeai, A check on a multiple correlation result. Jour farm Eeon, 

Vol XXII. No 2, 1940 

158 — , Agricultural situation and outlook work, national and international, 

FAO Mo Bui A^r Econ andSiai,\o\ lU, No 6. pp 18-28, June, 19S4 

169 Colima, David N . The Engineering Applications of Statistics, Industrial Math , 

Vol 7.pp 1-15,1956 

170 Uet Verband tussen de Temperatuur en het Verbrutk can Cos, 10 pp , Nederbndsch 

Economisch Instituut, Rotterdam, July, 1953 

171 Ventura, M Mateus, Velocidade ultra s0nica.par6coro, refra^ao molar efndice de 

refrapo, Eseola de Agtonomta do Ceara Pub Tech 6. A Forleleza, BrasiJ, 
May, 1951 

172 Maunder, A H , Size and efficiency in farming, Univ Oxford, Institute for Res 

in Agr Econ Occasional P<y>e«, W 23 pp , 1952 

173 Baum, E L , Earl O Heady and John Blackmorc, Economic Analysts of fertilizer 

Use Data, Iowa State College Press, 218 pp . 1956 

174 Jensen, Einar, and others, Input output relationships in milk production, U S 

Dept Agr Tech Bui 815, 88 pp . 1942 

175 Fcrger.WirihF , Measurememoftaxshifttng, economics, and law Nat TaxJour., 

May, 1940 

, Rcitt>fttcncmtamfedera}l3;eadmt/NStraiion, }ia} Tax Jfiur, Suns, 1943 

176 Bandeen, Robert A , Automobile consumption, 1940-50, Econometrtea, Vol 25, 

pp 239-248, April, 1957 

177 Breimeyer, Harold F, On price determuiation and aggregate price theory, Jour, 

Farm Eeon, Vol XXXIX. pp 676-694, August. 1957 

178 Heady, Earl O , Robert McAkxandcr, and W D Shrader, Combinations of 

rotations and fertilization to maximize crop profits on farms m North Central 
Iowa (an application of linear progranunifl^, JoHa Expi Sta Res Bui 
439, 20 pp , 1956 

179 Nerlovc, Marc, Estimates of the elasticities of supply of selected argicultural 

commodities. Jour, farm Eeon, Vol 38, No 2, pp 496-512, May, 1956 



CHAPTER 26 


Steps in research work, and 
the place of statistical analysis 


Refatfan of Stcthtlcal AnalysU to Research. Statistical analysts is 
only a tool to be used by the invcstjgaior The analyst must be a worker 
in some field, or in several, he cannot use his statistical training except m 
analyzing problems any more than a carpenter can use his skill without 
lumber and something to be made Now that the routine of statistical 
analysis has been discussed, and the types of problems to which it may be 
appbed hase been surveyed, it is pertinent to ask just what are the steps in 
research work and just where and how does statistical analysis fit into the 
picture 

The research worker must have an adequate knowledge of the facts, 
technical and otherwise, of the field in which he is to work This knowledge 
is usually insured by the situation that in most cases the worker is a 
biologist, an economist, a psychologist, or an agronomist, first, and then a 
statistician only secondanly or in addition When his training has been 
primarily in mathematics or statistics, however, the statisticianmustacquaint 
himself thoroughly with the facts and theories of the field involved l^fore 
he can expect to do significant and substantial work on applied problems 

Stot/ng the Objective If adequate acquaintance with the field is 
gi\en, the first step m a particular reseaich problem is setting up the 
objective of the project The objective can best be staled m the form of a 
direct question, such as "Why did sales of European automobiles m the 
United Stales increase from 1946 to 1958? ' The more exact and specific 
the question can be made, the more dearly is the field of the investigation 
defined Stating the objective as a question has the important effect of 
clarifying the issue, and so insunng that the worker knows what he is 
really trying to find out It has the further effect of instantly challenging 
the attention and of instinctively calling forth mental answers which aid m 
the next step of the research 


4«8 



S:tps tn Research Werk 


469 






Any projccs '^hich c^innot K- ■stated ;<.s 

not been ckvirly defined. Slarlinp out iv.crch "to co!\-ct ity'Jre' «*:' 
automobile vtles” would not con'^tiiute rc'-careh. Cle.ir fornntL'.tiO'i i>: 
question to be answered is an essentia! prerequisite of piod rcw-trclt s-.ot » . 

Dcvchping an Hypothesis. 7'iie scconci step ss ,t dech'Ctsvc an.ih is 
of the question raised to succest possible ansssers. 7 his an.'dss;-. {hass s on 
all the theoretical and practical trainine and experience the \s sprier has 
In addition, he may stud} presious ssorl. along the same lines, asl.snicstinns 
of those concerned in the industre, or mahe brief reconnaissance studies 


to decide on the factors ss hich mas be ins oK cd and to judge of th.e prob.'.blc 
relationships. Tltis phase of the research should lead to the setting up of a 
definite hypothesis as to the elements which will be insoived and of the 
ways in sshich they will be related. Thus in tltc automobile problem, tb.e 
hypothesis might he that the relative prices of domestic and imported cats 
was one important factor determining the demand; that tlie rclaitsc 
economy in gas consumption and repairs were also important, th.at 
foreign cars were used to supplement American cars for shopping and 
other .short trips, and foreign sports cars were purchased b} \sc,ii!h\ \oung 
people; and that the readiness to bin foreign cars was influenced by the 
crowth of facilities to service them, and bv ncouaintancc willi their me bv 

w « « • 

other pcpplc. 

The process of dc\ eloping the hypothesis may be aided by breaking up 
the main question to be answered into a number of suhqueslions. each one 
of which may be further broken up. Thin the initial question might be 
broken up into such .subquestions as “Do purchases %ar\ because of 
increasing cheapness of foreign cars? Are purchasers concerned with 
price per car or price per horsepower or pc paurid'! .Arc tlic} influenced 
by relative economv in use of gas. tires, and upkeep? Does readiness to 
purchase European cars differ between families lising in different kind'- of 
areas, or with different incomes, or with different compositions of the 
family? Docs the rclaiise u^c of different cars sap.’ between different 
geographical sections of the country?" .And so on until tlic p-t'^'ible 
approaches to the problem base been thought out for c\cry phase. 

in .selling up the hspotliesis the insestigator should aKo attempt to 
think through the probable nature oS the rclationsh.ips. Tiius, should it be 
assumed that the influence of relalnc prices on purchases will be consi.int. 


and independent of other factors, or is the relation likely to change from 
lime to lime tlirough the year, or with high f'r low lescU of emplinmer.i 
and prosperity? 

In setting up his inpothcscs. the tnxestigatof not only should rely on 
his own knowieditc but also should draw upon all tlic syil! and xnowleJge 
ssf other.s who base experiettCs' in the same field. This will iroohe not c'fiK 



410 Uses and Philosophy of Correlation and Regression Analysis 

a careful stud) of earlier invesiigaljons of the same problem but also 
discussions with practical men who arc operating m the field to be studied 
Thus the student of automobile sales should talk with automobile dealers 
selling at retail and at wholesale, with officials of automobile companies, 
and with individuals owming both Amencan and foreign cars, to get their 
opinions of the factors involved This will enable the student to check 
his hypothesis against the ideas of businessmen concerned with the same 
problem, and to check it with those concerned as consumers, and these 
people often may call to his attention elements in the situation which 
otherwise he might completely overlook 
Measuring the Factors Once the h)rpothcsis has been set up, and 
the various factors involved have been considered with much care, the 
next step is to secure measurements of the various factors This will 
involve deciding whether the data are to be taken from published records 
or other secondary sources, or whether they are to be secured first hand 
In the auto problem facts on imports sales, and registrations of auto- 
mobiles can be secured from official or trade association publications, and 
facts about individual car purchasers or owners, about the kinds of cars 
they drive, and about their families, might have to be obtained by direct 
enquiry Will they be collected by direct observation, by enumerators, by 
schedules, by mail questionnaires^ Advantages and disadvantages of each 
method, and the problems involved in laying out a record form, defining 
the units, securing the records, and checking or editing the reports are 
available in standard statistical textbooks and are not considered here 
In obtaining the basic data it is necessary to decide on the particular 
Items to be measured to represent the hypothetical factors Are prices to 
be listed retail prices, or net prices actually paid on purchase? Arc taxes 
and other sales charges to ^ included"^ Will the cost of “extras” be 
included'’ Will the allowance for “trade ms” be taken at face value, or 
will the net pnee be adjusted when a used car taken m trade is valued far 
above its current market value’ What charactenstics of the various types 
of cars will be considered — ^weight, length, speed, or fuel economy, etc ’ 
What variables will be used to represent the status, location, and other 
characteristics of individual families’ How will availability of services 
and repairs be tofiavired.’ Iw. kVa TOiy. h/i, p.ww. 

to whether a sample will be used or whether (as often happens in time- 
series analysts) the entire universe will be covered If a sample is employed, 
Will it be of the regression model, or of the correlation model, as discussed 
in Chapter 17’ The type of sample must be considered with reference to 
the use which will be made of the results, and the extent to which it is 
intended to generalize from them as to the relations existing m the universe 
Studying the Apparent Relations. After quantitative or qualitative 



472 


Via and PhUosophy of Correlation and Regreahn Analyils 


(2) OroGRAPHic DiFFERrsccs IN Location of Sales or Registrations 


% of all registered cars’ 
foreign (m countries } =/| 
or other local units) 


( proportion of 
population 

+/(■ 


f proportion of \ 
population sub- 
urban / 


(3) Between Individual Families 

/total number\ /number of\ 


income per 
capita 


1 4 ./-/^''®'’^Se cars\ 
/ \ per family j 


Number of foreign 
cars per famil> 


+/ 


^Location of 
home 


number' 
of 

adults 

Oo'ntly) 
average income' 
per capita 




(distance 
from nearest 
urban center) 


Some aspects of these relations might be explored by correlation 
analysis, and other aspects might recjuire other lypcsof statistical treatment 
Also, potentially available data on some aspects, such as (1), may cover so 
few observations as to preclude any elaborate statistical analysis The 
relations under (3), might need to be studied separately for different types 
of cars, such as sports cars, station wagons, and other passenger cars 
Unlt$ In Which Variables Are Stated Once the variables to be 
employed are decided on, the next problem is to decide m what units to 
state them In studying land values, for example, the value of a given farm 
may be stated as total value, as value per acre of all land, or as value per 
acre of improved land Which one lo select depends on what other 
variables are included and how they are to be stated The total value of 
the farm might be correlated with the value of the dwelling, the value of 
other buildings, the acres in cultivated land, the acres in pasture, etc 
This would tend to show the contribution per acre of each of the acreage 
elements and should give a high correlation, since under normal conditions 
the value of the farm might be expected to approximate the value of the 
buildings plus that of the several tracts of land In this case the simple or 
additive regression equation would be quite appropnate, for it would give 
Farm value = value of dwelling + value of other buildings 

+ (value per acre of cultivated land)(number acres of culti- 
vated land) 

+ (value per acre of pasture Iand)(nuraber acres pasture land) 
+ (value per acre of woodland)(number acres woodland) 
-F etc 



A^4 l/ses ani Phltosophy of Corf elation and Regression Analysis 

the problem or else tt would give a spurious result The factors would 
add up to exactly 100 per cent, and after variation m (^) and (B) had been 
held constant there w-ould not be any variation left in (C) * Only by 
dropping out one of the factors, say (C), would significant results be 
secured The regressions on (A) and (B) would then also show the cfTect of 
(C), for example, the increase in value for each unit increase m (/I) would 
mean the increase due to subsiiiuiing one unit of {A) for one unit of (C), 
changing the sign would give the effect of substituting one unit of (C) for 
one of (,4) The same principle would then apply as between (B) and (C), 
whereas the increase in the dependent vanable for substituting one unit of 
(B) for one of (/4) would be the difference between the two net regression 
coefficients 

After the variables to be examined and the nature of the regression 
function to be used ha^e been decided upon, at least tentatively, jt is 
necessary to decide whether curves are to be fitted If mathematical 
regressions are to be used, this involves deciding what form of equation 
IS to be used (Note pages 70 to 80 of Chapter 6, and 205 to 210 of 
Chapter 14 ) If curves are to be fined by one of the graphic methods, 
limiting conditions to be applied in filling the curves must be worked out, 
m the light of the hypotheses stated and of the technological and other 
knowledge of the relations (See Chapter 6, and Chapter 14, pages 211 
to 213) 

Sups In Carrying Through the Computations After the variables 
and the form of the equation for the statistical analysis have been decided 
upon, the next step is actually carrying through the computation This 
involves "coding the numerical values of the variables, calculating the 
extensions setting up and solving the norma! equations, and calculating 
the standard error of estimate, the coefficient of multiple correlation, and 
the standard errors for the regression coefficients Then if curvilinear 
regressions are desired or found necessary, they will be determined by 
mathematical or graphic methods (A reconnaissance study by the short* 
cut graphic method is often useful as a preliminary test before such further 
work) After the final curves are determined, the standard error of 
estimate for the curvilinear regression and the index of multiple correlation 
arc computed If joint functions arc suspected, the residuals are grouped 
with respect to two or more vanables, or studied with respect to compound 
variables, and a joint function is fitted graphically or algebraically, as 
found appropriate As a final step, the standard error of each of the 
regression coefficients should be computed and indicated on the regression 

• For an extended mathematical treatment of this problem, see Ragnar Frisch 
Statistical confiuence analysis by means of coinplete regression systems Oih Unliersify 
Okonomhke Insiltuti Pubhka'ton 5, 1934 



476 Uset and Philosophy of Correlation and Regression Analysis 

from the hoght of the water, he would find his forecast sadly in error 
if he made it for another day when the water was high because of a flood, 
or when the moon was in a diffcreot phase There is no direct causal 
relation between the two phenomena, yrt there is real correlation between 
them because they both are influenced, though very remotely, by the 
same sequence of cosmic events The nsing and the setting of the sun 
have a very definite influence on the movements of persons and therefore 
on the flow of traflic, whereas the rising and the setting of the moon 
likewise have a definite influence on the height of the water Washington 
has so low an elevation, that the Potomac River has a definite ebb and 
flood of tide There is a certain specific though complex relation between 
the rising and setting of the sun and of the moon, changing constantly 
from day to day This illustrates a case in which real and significant 
correlation between two vanablcs reflects relation to a common factor 
or factors, yet gives no inference as to direct causal connections Many 
similar cases are met with m practical work in which the correlation 
between two vanablcs is due to both being influenced by common causes 
although neither may in any conceivable way influence the other This 
illustrates again the need for clear, logical thinking and for a technological 
basis for the interpretation of the statistical results, which measure the 
relationships, but of themselves tell nothing of cause or effect 
Statement of Results of Correlation and Regression Analysis. Having 
completed the statistical analysis, the next step is to translate the statistical 
results to an intelligible non technical statement This may go only so 
far as simple regression chans or estimating tables of the type shown at 
the end of Chapter 14, or of carefully worked out pictorial statements 
such as shown m Figure 25 3 After the results are reduced to intelligible 
form— intelligible, that is at least to the investigator— they should be 
carefully compared with the original hypothesis If hypothesis and the 
statistical results do not agree, the hypothesis must be carefully examined 
to sec if It may logically be restated so as to be consistent with the facts 
as found, and the analysis must bcstudied to sec if there are any loopholes 
in the way the facts arc stated, or in the way the problem has been worked 
through, which may be responsible for the results (Jhe preliminary 
results, next to last paragraph, page 442, reflect such mis statement of the 
variables) If the hypothesis and results arc found to be consistent, or 
if, without doing violence to either, they can be brought into reasonable 
agreement, the research may be regarded as completed If such agreement 
IS not obtained, the results may be announced as actual observations 
inconsistent with what was expected and subject to further study or 
independent checks before being accepted as scientific conclusions 
Finally, if forecasts of future events or estimates for new observations 



APPENDIX 1 


Glossary and important oquations 


Glossary 

The Greek and Roman letters used as symbols m this text, and the 
most important of the other symbols, arc as follows; 

M, (Roman) = arithmetic mean of .V. 

c (smaW sig/iia) = standard delation in the universe. 

S (capital sigma) — sum of the items specified. 

n (Roman) = number of obserxanons in a sample. 

b (Roman) = cocfiicicnt of regression, 

/() (Roman) = function of the \ariablc in the parentheses, 

jjc (Roman) = standard deviation in a sample. 

r (Roman) = cocfiicicnt of correlation in a sample. 

/ (Roman) = indc.x of correlation (curvilinear relations). 

5" (Roman) = standard error of estimate. 

m (Roman) = number of constants in the regression equation. 

= (Roman) = residual, or dilTcicncc between observed and 

estimated v.alucs of a dependent vari.ibie. 

R (Roman) = coefficient of multiple correlation. 

(small beta) — “beta” cocfiicicnt of regression, in terms of 

standard deviation units; aKo univcr'C coeffi- 
cient of regression. 

I (Roman) = inde,\ of multiple (curvilinear) correbtion. 

r/ (small eta) — correlation ratio. 

0 (small theta) - function of (used here for the Bruce adjustment 

function). 

A (capital delta) — arbitrary symbol, 

m (small/?/) = arbitrary symbol. 

<I> (capital phi) — function of. 

A', }' (Roman) = variables, as observed. 


479 



460 


Appendix I 

r, y (Roman) = vanables, tn terms ofdepartures from their means. 

d (Roman) = coefficient of determination. 

k (Roman) = coefficient of alienation, also (eq. 17.9) number of 

vanables m curvilinear multiple regression. 
(Greek rho) *= coefficient of correlation in the universe. 
m (Roman) = number of vanables m a linear multiple regression 

study. 

List of Important Equations 

For convenience in rtfemng to the most important of the equations 
which are introduced from time to time m the text, all numbered equations 
arc repeated here m numerical order. 


11 

(i.i) 

X- At ^ = x 

(1.2) 

Mean deviation *= 

n 

(1.3) 

II 

(1.4) 

!. = J M; 

' V n 

(1.5) 

jZ{(Pr) r£(<//^i2 c* 

■V « L /I J 12 

(1.6) 


(2.1) 

, _ 

“v n — I 

(2.2) 

, - nMl 

' -V n- 1 

(2.3) 


s — 


(2.4) 



Important Equation 


40} 


( 7 , I 


y==a-^hX I<_ 1 . 

, i:(A' U 

b = — :— — : — t j c -> i 

If A'-) - rAMX- 

a — — hSf, (5 

i:(A']') - 

l'(A'=) -niAQ- =l'(^=}j 

f, _ i 

' ~ 3^) - (5.5i 


a =r Af. - /,A/. i 

# *• j 


r = n -f /’A' + cA’- ff'.n 


(With A'uscd for A^, r for A' — .Sf^. C’for A'% u for V — M. . equation (fA) 
becomes Y ~ a + bX 4- cU. These symbols arc ii'cA in equations |fi ?.] 
to [6.4]. inclusive.) 


{'!£,r^)h -f {^Tu)c — — 37.'! 
4- = ~u'/| 

o = Af^ - / {AfA - r(A/.) 


?6.:) 


If..:) 


zx zu 

M, = Af, = — 

n n 

Z7= r= ZX- - nM'i 

Zxu - ZXU - r.M, 

Ztr = ZU- - r.Mt 

Znj = ZXY - nA/.-Af„ 

Zuy==ZUY-nAf.Af^ 

Y=:a-j- hX 4- rA'* 4- r/A ■ 




;r 1 


(6 41 


if. 5* 


(With f/for .V% }'for,V= cquation{6.5jbecomcs )’ - « -r /'A' 4- cU =/>', 
lAicse symbols arc used in equations [6.6] to [6J05 inclus-.se.^ 



482 


Apptnd/x I 


+ (Sxk)c + (Srty = Zzy 
(Lxu)b + (2«*)c + (Swy = Sj/y 
(2>i)6 + (S«r)c + (Sr*>/ = — I'y 
a = M^- b{M,) - f(A/J - d{M,) 



L«t = SUK - 
Sxt> = SATK - nM^M, 
Zi^ = SK»-nAf; 
^ly^ZVY- nM,M^ 


(6 6 ) 

(6 7) 


(6 8) 


y = *A' + cA'* 

(6 9) 

= X(A'r)) 

(610) 

ziv^)c =S(t/r)/ 

5® = r = — 

*" ‘ n 

(7.1) 

5= 

/(»> ^ 

(72) 

■aii 

II 

&? 

(7 3) 


(7 4) 

C2 ”-^¥ /U> 

(7 5) 

n — m ^\n — m> 

(7 6) 

II 

(7.7) 

j,- 

»,- = — 

j. 

(7 8) 


(7 9) 



important Ejfuotion; 


4 SJ 


tMi] 

^''(i:( .V*) - n.\/;i:Lf f-) j * ’ ' 


1 

•*. ■ 

; 

•^4 ; 

! 

^ ! 
Hi 


2) 

JI5J 


:::(A'}') - H.u, 

-{^Vl 

(".?) 

;« J, 


W n - n ' 

— f 1 

IK4) 



(9.5) 


.Vj = <7 + h.x. + /-.vV, -r . . . 

X\ 2 = ^7 /U/V^ “f" 3 

Z(^)/a -f* — -Vi^:) I 

4- j 

G = ;\/j - hM. - /V.U;s 
A'l = G + /'jA; 4 - />.? A 3 

= = A', AV 

A'l = <7i.ci 4- (’ViA^ 4- /'istA^ 

Ai = 4- /’i-.s«AA 4- ^>i3t:A'r. 4- /’ur: V, 

A'l = f^i.oiij 4* 4" ^ncir-Aa ~ ^’uca-'A* t ^’ji;mAa 

S(4)/7,. m 4- i: A*r3>/;,3 „ 4- , - =- i 

S(x*r3)^7,;3, 4- -{33)S3 ;i 4* r.i “ | 

:^{x^r^)b^.^, 4- 4- or ^ j 

fli^i = A^i — /’jijjA/; — ~ 


(Sfi.U 

(il.l) 

ni,:) 

{!1.5) 

(n.-i) 

UL5i 
( 11 A 
(11 “} 
(UX) 

(IK9) 

(H.iU) 



4S4 


Apptndlx I 


“C^)^1!SU 

+ i^(a-,ar3)*is,tt + SCxsT,)*!! ju 





S(VJ 

• (11.11) 

2(TjXs)&jtju 

+ “(*3)^13-*is sss 



+S(x3rs)6,it)i = 



Etc 






^15 231 

i (11.12) 


#» «» 


(12 1) 


(£(^5) - [4i!JI .(£*!»!) + 4iJ!1 

1 


\ + +4ib,ss 

*J1 

J. (12 2) 

•S?S3I m ~ 

n — »i 



x; = 

OlJlj + + ^Utt 

,^4 

(12 3) 


^ItM = •^ 


(12 4) 


(^t23( ih(^ 1*!) ■*■ ^13 ti + • • 

■j 


1 +^J».58 (r»-l)(2irjX„) 


^issi « * 



— (12 5) 




(12 6) 


^IJ3» M ~ ^(^ ~ ^1^1 m) 


(12 7) 


.2 , * ~ 231 

'^»«23 * 1 pi 

• 23 


(12 8) 


^1 


(129) 

SA-, + SA-j + SAT, + EX, = S(r„) 


(13 1) 

ATi + M, 

! + ^^3 "V' ^^4 — ^^0 


(13 2) 

I(Xl) + 

2:(XiA',) + ^X^X,) + ECA'iA',) = 

: rCA-iS, 

) (13 3) 


= a' +Am +A(^i) + . . . (HI) 

jr, = a + i,.y, + i.(J?) + 4,Ar, + tj(J®l 

(14 2) 

+ i.,y, + 4,(A'') I 



Important Equations 

A'l = fl + /'i(.V;) 4- -f fy .(XJ) -f />.! -j. f ,Vi 

+ MAC) -f /'.(.V,) 4. 4. 

= «i.23t -f /^’(A;) +/t'(AV) 4'./V<AV) 

“/;( AVi] 

t/j ■— -‘Ijj **“ 

"" // 

=' = A', - AT 

Aj — -^iCAj) =^^Aj) — -f A/j 

a-; = r2(x,) =/;(A';) - A/,,,. 

A'/ = + MAV) -f . j. 

Ai./( 2 .n X) = +y^(A;) -}■/:>( X;) -r . . . -h/)(At> 

1) = A', - X[ 

^l.fd.O X) — 

AC = « + -h luX! -f luX, -r KX^t 


.o 

r.Aw,... 

,1) 

= 41(1 

— 

/(j .3'. 



c'^ 


/ ” \ 

f./(C,a.... 

.i) 

— 4i./(j 


■ 1;; — ml 

/- 

..1 

= I _ 

57 

JW J ^ f * . • ' 




■O' 

h.fixz,... 

X) 

= /Ha, 


.X 

((i-a . 

..t 

= '•ku.i 

1*> 



AC = 4* ^n'-S'x'tTiCAi)] 4* e i {/:A A ;A] -f /’jr.'. DiCA*,)] 


4;r4 — X- 




n 


^ /__iiEi— L. 






Vr: 

V = 


5;c 


5f = -Hi: 4- (w 4- rxr 

n 


4BS 

{S4.}, 

( ^••,.^) 

(14.5) 
(1-'. 6) 
(! 4 .-) 
(i-i.M 

(14.9) 
(15,!) 
(i5.2i 
(!5.:4 
(!5.4) 

(15.5) 

(15.6) 

(15.7) 

(15.5) 

(15.9) 

(15.10) 

(15.11) 

(n.i> 

(17,2) 

(17.5) 
1 17.4) 
(17.5) 



4S6 


Appendix f 




*12 



’-'’-'MM;) 

D2 

... ] r^a 



^ISS * 

L 4 

J In — m/ 

1 U «,23 

^23 * 



1 

1 

i 



(17.9) 

-r “ ^jv + (V*)' + 

(19.1) 

1-,, = ■5? !Jl[l + ^ + <■!!»! + <■»*! + <^uA 


+ 2 c 2 jXjr, + 2cJ^T2X^ + 2 c 34 iia: 4 j 

(19.2) 

(SdJcj, + (5j-^,)rj, + = 1 ■ 

(SijTgVjj + ( 2 ^)rj 3 + (£ 2 : 3 X 4 )^*, = 0 

(Si,x,)c„ + (£i 3 j-,)c„ + (Z^)Ca = 0 

(193) 

(Snlfj, + (Sijji)c„ + (SiiX(>si = 0 ' 
(Sxjjjlc:,, + ( 2 j 3 )i-n + (SxjxJcm = 1 

(SxsijVsj + CSxjXjVjj + (£i;)C3, = 0 

(19.4) 

(SaDf,, + (SijXjCa + (SxjXjlCi = 0 1 
(S»A)c<i + (Si 5)C43 + PixiOc,, = 0 1 
(SVjki! + (Sx»r,)c„ + (Ij^)c 44 = 1 J 

(19.5) 




Imponanl Equations 


" — C- 

r:- 


1 4- _ 4- 4. 

/I 


on condition thnt (c^c^) — c^. — c^x etc. 



A- 

S7 



U1 

J ■ 


UO T) 


cf'.r) 


A', =/(A2. X^) 

A'l = 0 + cA's + gCA^A 3) 

A'j = f? + eA'a + gi^z^'o) + ^A;) 

A' =/2.3(A'2, A' 3 ) +/4(A',) 

A',=/2.3 .u(A'' 2 ,^VAV-.-A;) 

A'l = fz.zi^'z' A’a) +/4,s(Ait Aj) + /oCA^) 

A"i = fl + ZjoA 2 + ^3 A'a + ^ •■’ 

+ b,VYi +bsVYY + b.vTxl + h,,\ 'XzX, 

/V[nn(A /o^)] - r.[M,)- 

„ ^5% - IKnoA/^) - 

j:- = — — ~ "■ 


(2i.n 
(21.:) 
( 21 ..') 
(2!. A) 
( 21 . 5 ) 
{21.^-} 


( 21 . 7 ) 


( 22 . 1 ) 


ts - 


-r'/ 


S;.,(2:^) 


Q = ai + IhP + u 
Q = 02 + b^P + t' 


C?.!) 

(2?.2j 

c-i.n 

(2-.2i 

(2-i3) 


q = bip + If 


(2". 2) 




Appendijt 1 

g a b.j>+r 

(24 5) 

V — u 

(24 6) 

II 

1 1 
.?■ 

e 

(24 7) 

Zl(i.t>-i^X<’-“)l 

1/ Uv - u)* 

(24 8) 

£(4^«^sa + + £(r^4)csi = 1 1 

2;(^*arj)f22 4- + S(r3ar4)(:j4 = 0 1 

X(XiXt)c„ + ^(r^t)Cs3 + = 0 J 

(A2 I) 


I, ~ ■^1 

= CA2 2) 

*r, *“ '^U3i|^ + ^ ^‘^2''('^i ~ ^fj)* + cJjtA's — Afj)* 

+ c'M - f>Uf + 2cUXt - A/,) 

X {X^ - A/s) + 2f:,(Arj - A/,)(A'4 - A/4) 

+ 2c;,(A'3-A/aKA'4-A/,)jJ 




490 


Appendix 2 


Table A^l 


Calculation of Extensions, Using the Check Sum 



Variables 



Extensions with A', 


X, 

Xt 

X, Xi 

ZA* 

xl 

X,Xt 

XtXi 

x,x, 

A'.ZV 

0 

136 

106 103 

345 

0 

0 

0 

0 

0 

1 

140 

103 103 

352 

1 

140 

103 

lOS 

352 

2 

86 

lOS 102 

298 

4 

172 

216 

2£M 

596 

3 

IIS 

102 111 

331 

9 

345 

306 

333 

993 

4 

115 

111 95 

325 

16 

460 

444 

380 

1,300 

12 

161 

91 109 

373 

144 

1432 

1,092 

1,308 

4,476 

13 

235 

109 118 

475 

169 

3.055 

1.417 

1,534 

6,175 

14 

3(M 

118 123 

559 

196 

4,256 

1,652 

1,722 

7,826 

IS 

224 

123 103 

470 

225 

3,360 

1,845 

1,620 

7.0S0 

16 

185 

108 100 

409 

256 

2460 

1,728 

1,600 

6.544 

17 

lOS 

100 88 

313 

289 

1,836 

1.700 

1.496 

5,321 

18 

193 

88 109 

408 

324 

3,474 

1,584 

1,962 

7,344 

19 

175 

109 103 

406 

361 

3,325 

2,071 

1,957 

7.714 

134 2,177 1,376 1,377 

5,064 

1994 

25,315 

14,158 

14,224 

55,691 


Exiensions wiih Xt 


Extensions with Xt 

Extensions 

wiihA*! 

A'J 

XiX, 

X,Xt 

XtZX 

''^4 

XtXi 

A'.IAT 

Xi 

A'lSA* 

18,496 

14,416 

14,008 

46,920 

11,236 

10.918 

36,570 

10,609 

35,535 

19,600 

14,420 

15,120 

49.280 

10,609 

11,124 

36,256 

11,664 

38,016 

7.396 

9,288 

8.772 

25,628 

11,664 

11,016 

32,184 

]0,>UM 

30,396 

13,225 

11,730 

12.765 

38.065 

10.4(M 

M.322 

33,762 

12,321 

36.741 

13,225 

12,765 

10,925 

37475 

12.321 

10,545 

36,075 

9,025 

30,875 

25.921 

14,651 

17,549 

60,053 

8.281 

9419 

33,943 

11,881 

40,657 

55,225 

25,615 

27,730 

111,625 

11,881 

12,862 

51,775 

13,924 

56,050 

92,416 

35,872 

37,392 

169,936 

13.924 

14,514 

65,962 

15,129 

68,757 

50,176 

27,552 

24,192 

105,280 

15,129 

13,284 

57,810 

11,664 

50,760 

34,225 

19,980 

18,500 

75.665 

11,664 

10,800 

44,172 

10,000 

40,900 

11,661 

10,800 

9.5W 

33.8{W 

10,000 

8,800 

31,300 

7,744 

27,544 

37,249 

16,984 

21,037 

78,744 

7,744 

9.592 

35404 

11,881 

44,472 

30.625 

19,075 

18,025 

71,050 

11,881 

11,227 

44,254 

10,609 

41,818 

409,443 

233,148 

235,519 ' 

903.425 

146,738 

145.923 

539.967 

146,855 

542.521 







An 


Appenifix 2 


Table A2.2 


Calculation of Product Sums Correciid to Departures from Means, 
With Check Sum 




XT, 

Xt 



Lids 


m 

2 177 

1476 

1,377 

5,064 


.VrtrtJ 

to 30769 

167 46154 

105 946)5 

105 92308 

38953146 

2 


l.»94 00 

23JI5 00 

14,138 00 

14,224 00 

55.691 00 



1 3SI.23 

22.439 S4 

14.183 3S 

14,193 69 

52.198 14 

4 

EMensions with 

611.77 

2.t75l6 

-25 3S 


3.492 86 

5 



409.443 00 

133.148 OO 

23S.S19DO 

903,423 00 




3M 563 77 

230,427 08 

230,594 54 

848.025 23 


Exteuianv wiih xx/ 


44.S79 23 

2.720 92 

4,924 46 

55,399 77 

8 




146.738 00 

145.923 00 

539,967 CO 


Corrections 



I4S.644 30 

145,750 15 

536.004 90 

10 

Extensions with Xi ^ 



1,093 70 

172 85 

3.962 10 

II 

twensionswnh Xj 





541.521 00 

11 





I4S.S56 08 

536.394 48 

13 

Extension* wuh r, / 




998 92 

6,126 52 

14 


gives the values which are entered m Ime 5 These values are the extensions, 
expressed as depanures from the means 
In column for example, the entry tn line 3 is SA'^A's; and the entry 
in line 4 is SA'jAf} 

The entry in Ime 5, then, is SA'jATj — SA'jA/j 
= ZX^X:, - nM,M2 

= STjXj 

Again, the values m the first four columns add to the same as the value 
in the check-sum column, verifying the work 
The rest of the table is entered in similar fashion. Lines 6, 9, and 12 
are the extensions with X,, Xt, and X^, from Table A2 1. Lines 7, 10, 
and 13 are the values in the corresponding columns of line 1, multiplied 
by A/j, A/,, and respectively (from Unc 2). Lines 8, ll, and 14, 
obtained by subtracting the items in lines, 7, 10, and 13 from those in 
6, 9, and 12, show the values corrected for departures from the means 
In verifying the sum of the other entnes in line 8 by the check sum, 
the Item must be included, from column Xs, Ime 5, before comparing 
with the check sum; in checking line 11, Zx^t and Zx^^, from column 
Xt, lines 5 and 8, must be included, and in checking line 14, the values 
5 >iTj. SriXj, and from column Af,, lines 5, 8, and 11, must all be 




494 Appendix 2 

Xf and Ai, and companng the sum iMlh the cheek sum m column ^Y. 
The three values add to 39,011 02, agiwing to 001 with the check sum, 
39.0M 03 

The salucs in line 2* are next divided by the value m column Aj with 
Its sign changed (—31,388 78) The quotients are entered as line IT. 
Again the check sum venfies the computation 

The values from line 1 1, Table A2 2,are then entered as Iinelll, beginning 
with column (Again disregard the figures m parentheses) Line I 
is multiplied by the value in column ^*4 of line I' (0 04142), and the products 
entered in the corresponding columns below line III, and line 2, is 
multiplied by the value in column A4 of line U', and the products entered 
in the corresponding columns in the next line Line III and the two follow- 
ing lines arc then summed, giving line 2, The values in line arc 
divided by the value in column of that line, mi/A its sign changed 
The quotients are entered as line IIT Again the check sum verifies the 
work The values m hnc 23 (before the check sum) add to 517 10, which 
agrees to 0 05 with the check sum of 577 05 

The values m lines 1,11, and 111* of column ATj, »u//i the signs changed, 
are then entered at the foot of columns X 3 , and X^ (designated here 
^13 34 d A, 4 sj) The value at the foot of the column, —0 30944, 

IS the value for h,4 jj The item m column A'4 hne 1' (004142), is then 
multiplied by the last of these values (—0 30944), and the product 
(—0 01282) entered in the A*} column, and the item m column ATj, hne IT 
(—0 09048), is also multiplied by —0 30944, and the product entered in 
the A 3 column The two cnirtts at the foot of the A*, column are then 
added, giving 0 18036 as the value for 6,324 The item m column A’3, 
line 1 (—4 69207), is then multiplied by 0 18036, and the product 
(—0 84626) entered below the other two entries at the fool of the A* 
column The sum of these three entries, —080962, is then the value for 

^i*»i 

The way the check sum works in checking the operations may be seen 
by filling in the missing spaces m Table A2 3, as indicated by the entries 
enclosed in parentheses Thus m line II, the first item, 2,875 16, is the 
same item, as appears in hne /, column X^ If when line I bad been 
multiplied by —4 69207, the operation had included the Xi column also, 
the product would have been —2,875 16, or exactly the same as the item, 
in line I, column A'3, with the sign changed This value, entered below 
line II in column A'2, exactly cancels the previous value when the two 
lines arc added, leaving line 2^ still the same 

Similarly, the values —25 38 and 2,720 92, from lines I and II of 
column Xf, may be entered m parentheses, in columns Aj and A'3 of line 111 
If the previous operations had been earned out in full, below them would 



l,\tlhod% of Computation 

appear 25.38 in column X\ fcotumn X,, line ?. time'; —I), ar.;I 119 fi- 
and —2.840.00 ([column X, line Ilf— 4.69205] and column AV Imc 2:,, 
times —I). When the three lines arc totaled to civt hr.t the nenn 
exactly cancel out. as before. 

It should be noted that when all the items are entered in c.:ch i;nc, 
including those in parentheses, the sum of the items in columns .V. to 
A'j exactly equals, line by line, the item in column 2I.V. For that rem ' sn, 
if any error is found uhen one of the 22 lines is reached, the line in v.hich 
the error occurred can be determined by .adding the items li.ne Fv line, 
and verifying the totals against the indisidual check sums. To tfo this 
it is not necessary to enter the missing items, .as h-is been done in T.eKe 
A2.5 (in parentheses); instead, the items left out can be picked out bv 
going up the columns for the particular variable conecn'icd. Thus all 
the missing terms for line III (extensions for and the next two lines 
appear above in the X\ column. Once the location of the moM'ng items 
in the previous work has been learned, they can be used to \crity the 
computations line by line, and any error readily located. 

The “back solution” is simply the solution, in rcgul.ar form, of lines 
nr, II', and r for and /;,. Thus line 111’, if ssnttcn out. is 

-h, = 0.30944 

Hence = —0.30944. the value at the foot of column AV Similarly, 
line ir, written out. becomes 

-/?3 - 0.0904Sh4 = -0.15236 

Subslitutins the above value for h,, and rcarrancinc. 

hj = 0.15236 - (0.0904S)( -0.30944.) 

= 0.15236 -f 0.02800 

These last Isvo values arc the same as shown at the foot o! column 
hence = 0.18036. 

Similarly line T, when written out in full. 

_ 4.69207(13 4- 0.04142(/t == -0.04946 

Substituting values foi and b^, and rearranging. 

b. ~ 0.04946 + (0.04142K-0.30944) a- {-4.69;07)'0.18n,Vs) 

= 0.04946 — 0.01282 — 0.S4626 = -0,801962 


exactlv as shown at the foot of column ,X. 

V 

Having computed the values of the three re 
fin.al steps arc (a) to check those s'alucs by sud' 


arc'sion coctTicienis, tl 
■tiiuting them in the /.r 




496 ^peniijt } 

equation (Ime HI, tn full), (6) to compute the coefficient of muliiple 
correlation, and (c) to compute the constant for the rcgressioa 
equation These steps are all shown in Table A2 4 


Table A2.4 

FivAL Steps in Solution of Multiple Correlation Problem 

V vUbIc 




-0 torn 
0 (8036 
-0 30944 


30 S5 
4M>T9 
-ni 4) 


ioit 

492444 

mts 


-34 S4 
SM II 
-$}4» 


10 308 
167 482 
10S844 


The first operation in Table A2 4 is the final checking of the entire 
solution, including the back solution This is done by substituting the 
values found for the 6’s in the last equation of the normal equations 
For this problem that equation is 

+ Xfa-axJfij + 2:(4)*4 = Sarji, 

The values of the 3 b's are entered in column I of the tabic, and the values 
of the corresponding coefficients of the unknowns, such as etc., 
are entered in column 2 The product of each b with its coefficient is then 
computed and entered in column 3 These add to 172 87, checking 
satisfactonl) with the value of SCrjari), 172 85, as shown at the foot of 
column 2 

The computation of the coefficient of multiple correlation, according 
to equation (46) 

A, 

is shown in tabular form in columns 4 and 5 
The values etc , as shown in Table A2 2, lines 5, 8, and 1 1 of 

column A'j. and E(i,), shown in line 14, are entered m column 4 of Table 
A2 4 Each product sum is multiplied by the corresponding b, shown in 
colunia U and the ewt/ewi wi esaturw. S TV<e smto. tbA'je ^.o- 

ducts is then the numerator of the fraction in equation (46) The com- 
putation IS then readily completed 




234 


810 15 
998 92 


= 08110 


7? 5= 0 9006 With n = 13, and m = 4, 

^5 = 1 _(1 -08I1)»« = 0 784, and/! = 0 86 


Methods of Computation 

The standard error of estimate may also be rcadilv c.-vr.'Xi>cd 
-(^i) = nsf = 99S.92 
= nj?-, = 810.15 

then since nsi — 

2e= = nr = 188.77 


Since there are 13 cases and 3 independent vanahk'. 


and 


•^.23 < 


/;.c 

n ~ m 


18S.77 

9 


20.97 


— ■^•58 


477 


The a for the regression equation is ne.xi computed. L'sine ccu.iti.-n 

(11.10), ' 

^i.23.j ~ hfi — — bj^Mf 

wc may arrange the work in tabular order as shown in columns 6 and 
of Table A2.4. The means, from line 2 of Table .‘'2 2. are entered sn 
column 6, then multiplied by their respective oT. and tli3 pro;!’.::;* cr.tirrd 
in column 7, To complete the computation, folio Ana cqun!;').o (’i-lOj, 
the sum of this column is then subtracted from the ir.-.m of .\'j 

= 105.92 - (-10 90) = '.if.tT. 


This completes the computation of all the linear nimtipic cor! '''.r’ 'n 
constants.^ The results may be summarirxd . 

X[ = 116.82 - 0.810A; -f O.iSO.Vs - 0..V TVj 

■^1.23-1 ~ 0.86 



498 


ApptniiiK 2 


Table AZ5 

DoOUTTLE SOLI/TIOS OF NOR»tAL EqUATIOSS FOR SiX VaRIABLU 











of CompuiaVsn 4 ^ 

Standard Errors of Partial Rcfircssion Cacfpcienu and Standard 
Error of an Individual Estimate. The computsHor; of^s.iodan! of 
net or partial regression cocfTicicius bv equation ('1T,2), and ofjt.irdard 
errors of an individual estimate, by equation'. (19,2) to mas iv 

simplified by the following proeedore: 

-f -b = 0 ; (A2.1 ) 

4- 4- iifjqhii ~ 0 J 

Solve simultaneously to obtain the values for c-,. .and r.,. ’Hten ?;t 
up exactly the same set of equations, with r.,. artd C:^t ns the unlnrosr.s, 
and with 0. 1, and 0 to the right of the equal signs, in the first, second, and 
third equations, respectively, and solve. Then set up again, with f|,. r.,. 
and C4J as the unknosvns, and ss ith 0, 0. and 1 to the right of the equal sirns. 
and solve again. The standard errors of the regression coefficien's m.av 
then be found by the following equations. 

f CA2-2) 

<^41 j 

Except for the values to the right of the equal sign, the cocfficicnt-s of the 
equations arc c.xactly (he same as those required to obtain the values of 
ea- reason the values for r^, 0.3. and r,, may 

be most readily calculated by introducing as many new columtts in the 
form of the Doolittle solution (Table A2.3) as there arc independent 
factors, between the columns for and 21. These columns ssill be 


Line Error h. lirror hj Error /», l-rror />. 


(Eq. I) 1 0 

(Eq. ID 0 1 

(Eq. Ill) 0 0 

(Eq. IV) 0 0 

etc. 


0 

0 

I 

0 


0 

0 

1) 

I 


Tlicse values can be included in the check sum. and the operations 
carried through for them just as for the other columns until th.e “back 
solution*’ to find the o's is reached. Then a separate ‘’back solution” c.an 
be run for each set of ‘V* values, sinning with the values in each "Errot” 




SOO Appendix 2 

column just as the back solution !o find the b'i started with the values m 
the A'l column * 

Table A2 6 shows all the computations necessary to compute all the b's 
and c’s from the product sums calculated in Table A2^, except for the 
back solution on Xi, as shown in the lower section of Table A2 3 Tabic 
A2 6 thus replaces all of Tabic A2 3, except this last section In practice, 
this back solution would be included in Table A2 6 ahead of the three 
back solutions on Ct, Cj, and Ct 

In computing Table A2 6, the work m the c colunms is earned out to 
two more decimal places than in the other columns This is necessary 
because of the small site of the values involved It should also be noticed 
that in the back solution on c, only and Cgj are calculated directly 
Since Cjj IS identical with Cgj, the value previously calculated for the latter 
IS inserted instead Similarly, the back solution on involves no 
additional calculating at all, since c*, ts copied down (with the sign changed) 
from line 111', c,, is written down for and for Only the com- 
putation by substitution m the check equations is involved Even that 
computation can be omitted for ihe c, values, since each of them has 
been checked earlier— and r«j by sulwtitution and by the check sum 
in lines Sj and III 

As a result of these computations, the following values are secured 
*0 00259. c„ « 0 000042, du = 000120 

Since = 4 58, the standard error of the 6’s may be readily calculated 
by equation (A2 2) 

Vo 00259 = 0 233 
= 4 58 V0 000042 = 0 030 
= 4 58VOTOI20 s= 0 159 
The net regression coefficients may then be stated 

-0 810 
(0 233) 

0180 
(0030) 

-0 309 
(0 159) 

* For an explanation of why this process and equation {A2.2) gives the standard 
error of the 6 s see Note 13 m the second edition of this book For other uses of the 
‘ c ' constants see R A Fisher, Statistical Methods for Research Workers, 12lh ed , pp 
129-166, Oliver and Boyd, Edinburgh and London, 1954 


^13 « =* 
« — 



of Ccmp'jtction 


53f 


Table Ai& 

SocunoN* OF Kop-mal Ho-ATKe-s BY YHi D-xn.jnxF Mni! -v, 7u C*.tn : 
RrcPiASiO’.* Ccriftcsivxs ano TitttK SrAS^APn I ppvf's 




































iM Appeft(?/x 2 

Just as m the illustrations discussed in Chapter 17, some of the net 
regression cocfRcicnts arc much more reliable than are others If we 
assume that the conditions of random sampling are fulfilled, there js some 
possibility that the regression for in the universe from which the 
sample was drawn IS really positive instead of negative, but there is onlj a 
very slight chance that bun really positive, and it is almost a certainty 
that bi 3 .j IS really positive, and above 0 1 
The computation of the standard errors of the net regression coefficients, 
by the method just presented is not a difficult one It should be made an 
integral part of every multiple correlation solution, so that not only will 
the regression coefficients be obtained, but also the amount of confidence 
that can be placed in each value will be determined Only if that is done 
can the regressions be interpreted with confidence 
The computations shown in Table A2 6 also give all the values needed 
to estimate the standard error of an individual estimate Substituting 
these values in equation (19 2), and using the value for S’ijm previously 
calculated on page 497 {in practice the calculations on that page would all 
be made after Table A2 6 was calculated, including the back solution on 
Xi), we have 

(X) 4.-., = 20 97j^I + i + 002594 + 0000424 + 001204 

+ 2(- 00020)i-,»j + 2(OOOS6>rrt + 2(- 00011)r,i,j 

The use of this equation may be shown as follows Suppose we draw a 
new observation from the same universe as that from which the original 
sample (shown in Table A2 1) was drawn, and the new observation has 
values of 18 for Xg, 3(X) for Xg, and 90 for A', After we estimate the 
probable A'j value from the regression equation, how much confidence can 
wc place in that estimate 
The estimated value works out as follows 
The regression equation (from Table A2 4) is 

Xi = - 10 90 - 0 80962A'j + 0 ISOSeATs - 0 30944 AT* 

= - 10 90 - 0 80962(18) + 0 18036(300) - 0 30944(90) 

= 78 29 

Before the values of Xg, Xg, and for this new observation can be 
substituted m equation (X), they must be put m the form of Tj, Xg, or x^ 
Using the means shown m Table A2 2, we calculate 
Tj = A'j - Mg = 18 - 10 31 = 7 69 
ifa = A'j - A/j = 300 - (67 46 = 132 64 
*4 = A', - Af* = 90 - 105 85 = -15 85 



S04 


Apf>entf/jr 2 


Table A2.7 

Dooltitle SoLtmov of NoRstAL Equations, to Find CoEFFiaevrs or 
Partial Correlatiov, for Three Ivdependent Variables 











































SC6 ^ptfidix 2 

solution should be arranged m this column order A*. Aj. Aj A, \, 
After the entnes were calculated through the front solution, two baeV. 
solutions could then be run. one leasing out the Aj column, and one the 
A's column Where there are six independent vanabics, a third step could 
be used by repeating the last two steps of the front solution for the third 
independent sanable to be dropped out, or a complete new solution 
could be run with A 3 and A 4 occupying the last columns before In 
manj-vanablc problems, various other time*savmg combinations can be 
worked out by the ingenious computer 
Tables A2 4, A2 7, and A2 8 provide all the values necessary for the 
calculation of the partial correlation cocflicicnts, using equation (12 8 ) 
These calculations may be tabled as follows 



Variables 

0) 

R- 

(2) 

1 - fi* 



1234 

06110 

01890 



1 23 

07309 

0 2691 



1 24 

00293 

0 9707 



1 34 

05580 

04420 


'Ijm = 1 — ' 

I — /?l J34 

1 - Hi,, 

0I890_ 

‘ *“ 0 4420 “ 

0 5724 

-076 

r* - 1 — J 

1 — yi| *34 j 

01890 

0 8053 rj3 jj “ 

089 

MS 21 — ‘ * 

^ “ ^1 2t 

"09707 

r= -l-J 

1 "" ^ISM I 

0I890_ 

02977 = 

-0 55 

'1123 ‘ 

1 -H,„ 

‘ "02691 " 


The signs of the partial correlation coefficients are taken from the signs 
of the corresponding net regression cocffiaenls, as shown m Table A2 3 
or A2 4 

Alternotlve Methods of Solving Normal Equations. The methods 
for solving normal equations and obtaining the various constants necessary 
in correlation analysis, which have been presented m Tables A2 3 to A2 8 
inclusive, employ the 50<ancd Doolittle method of solving equations, 
first developed by Dr M H Doolittle, a computer m the Geodetic 
Survey* His method involved slight modificaltons of the methods 
originally suggested by Gauss, the discoverer of the least-squares technique 
(The solutions shown m Tables A2 7 and A2 8 involve short cuts added 
by the senior author ) The use of the (^I-O, etc , method of calculating 
*M H Doolittle, Adjustmenc of the primaiy iriangulation between Kent Island and 
Atlanta base lines (Paper No 3 Method employed in the solution of normal equations 
and the adjustment of a Iriangutalion) Report of the Superinfendent, Cozmnd Geodetic 
Survey, pp 115-120, 1878 



Mtthcd: of Computalh:i 

error formula? (the rcciproca! matrix), aUo fir-t lic-.rlrpc-J bv naui*., 
and was revised by R. A, Fisher. Its further arpheasion cilceh-itinr the 
.standard errors of an individual estimate ssas developed bv the b.te M, A. 
Girshick of the U. S. Depanment of .'\cricu!ti'.re, at the senior .ii<tf;o*'s 
request. 

A simple short cut in the solution of the norm.-i} csjuation-- has been 
suggested by P. S. Dwyer.* He points out that inueh of the ''front 
solution” involves subtracting a series of products fro.m. or rnldinr: therr: 
to, a given figure. In Table A2.5. for ex.amp!e. the item, that .-ippears in 
line of column A's is simply the s alue: 

100.0(K)0 -f (25.6900)( -0.256900) + (39.2091 )f -0.4 146-!0i 

-h- (If.SSdfOf-O.lfAiO?) 72.4:7: 

With modern computing machinc.s this vrdue can be computed directly 
without clearing the total dial, using the reverse lever \shencver the ptod'uct 
is to be subtracted instead of added. This ntethod s.a‘.cs reading off and 
entering in the table the values that appear in line lV- 1 . IV*2. and 1V.3. 
Whether this additional operation and possibility of error ofPet the other 
savings each computer can determine for him'clf. 

Using this Dwyer .short cut ail the way through, the front sohiiion of 
Table A2.3 would show only the lines I and I', Xjand IF. and and 111'. 
Similarly Table A2.5 would show in the front solution only 1 and IF. 
and IF, S 3 and IlF, S. and IV'. and S. and V'. Various other pos'^ible 
modifications of the Doolittle solution, all bated on the same b-isic 
principle, arc shown in Dwyer's article. 


Obtaining all Multiple Correlation and Regression Coefficients 
by Matrix Solution. 


Modern developments in calculation and matrix solution b.a'ie Ivcen 
combined in one composite solution to give practically a!! cor.statds 
desired, in a recent publication by Friedman and l-oole.-' L'>ir.g the 
problem presented in Table 13.1, the successive steps in th.is meihcd are 
summarized below. 


Obtaining the Augmented Sums of Situates and Crots-Ptoducis, 
The first step is to compute the “augmented” sums of squares^ and crov> 
« P. S. Diwcr, TTic solution of simult.incou-i cqu-mom, rn (h Vrl. i . 1, 

April. 1941. ' 



Table AZ9 

CovfPUTATION OF AUGMEKTtD SUMS OF SQUARES AND CrOSS-PrODUCTS* 


SOS 



ps; 





ApptnSix 2 





I 

E 




c R'aS >>**5 

*- 5 J-llfc £ g b £ I fc 
E sSMSsMQSiSM^MOSiuM^HDSSM^MdS 


• The computaiiont were perfotwtd with 9 decimal places, of which only < appear in the table, therefore, some ofihe compui: 
be slightly in error Where fewer decimals are shown than 6. they were all zeros 

t For this example, Af « 20 

* The k,k, value is shown m parentheses, as taken from Table A2.I0 


510 Appendix 2 

The terms omitted from this row tn the table. 554,069 and 279 613 9. are 
obtained rcspcctiscly. from the first row of the first section and the first 
row of the- second section of column (3). the column in which the first 
written term of the row, 5,170.916, appears In general, for the ith row of 
an) section, the omitted terms to the left of any given term, call it m, are 
obtained from the ith row of each section of the column in which m appears 

If a discrepancy due to a rounding error should occur, the sum across 
the row is considered as the correct figure and the figure originally shown 
in the 1 column is corrected accordingly This corrected value is used m 
further computations (For details of this adjustment, sec the Friedman* 
Foote handbook page 5 ) 

Since only a limited number of decimals arc shown here, a \/ is placed 
after afl items m the i- column that serve as checks However, rounding 
errors do occur in some of these items These result in part because the 
omitted figures were dropped without rounding In such cases the wrong 
figure in the 1 column is crossed out and the sum of the columns is 
substituted 

If an error is made tn computing the sums of squares and cross-products, 
the following method is more elTicicnt than a direct rccomputation as a 
means of locating the error Suppose that the checking operation indicates 
that an error has been made in obtaining the extensions with Continue 
to calculate the extensions with and, if the check lor this indicates that 
no error has been made, we. know that the augmented moment between 
A, and A', in the first section is correct Similarly, if the extensions with 
A 3 check we know that the augmented moment between and m the 
first section is correct If all other extensions check, the mistake is m the 
computition of the sum of the squares for A^ If one of the extensions 
docs not check, recomputation of the corresponding element in the first 
section is indicated A similar procedure is used if the initial error occurs 
in an extension other than with Xi 

Adjustments to Make the Sums of Squares Neorly Equol 1. It is a 
great convenience m computations to have all the elements on the main 
diagonal dose to 1 In making this adjustment we are concerned only 
w.’j? fhc fas! /•.!?«' {ft evacA’ /ew/.vw Tai^'c A2P A set of i 3}{fcs Ihsi sre 
powers of 10, the k^, where i is the variable to which it applies, is chosen 
suc'h that when the sum of squares for the variable is multiplied by the 
square of the ki the answer lies between 0 1 and 10 The value (A,)* is 
referred to as the adjustment factor The k, for this example are shown in 
the second column of Table A2 10 They arc determined m the following 
manner In Table A2 9, note that the sums of squares for ATj, A'*, and A'^, 
respectively, lie between 1,000 and 20,000, therefore the adjustment 
factor equals (0 01)*and A equals 001 for Aj, A',,and A\ For A'^ however, 



Table A2.II 


CALcifiAnwo Partial and Mulupu RtoMssios Measures tor 


Row 

■1 


m 

m 

(5) 

(1) r, 

(2) i 

(3) 1 

(4) ^4 

763984 ; 

— 199536 
360824 

360200 

413180 

1 408320 

- 084324 
-^99244 
113360 

1 763939 

840324 

^75224 

2 295060 
M937JI 

(I) 1 

(D (1 308927935) 

763984 1 

1 

- 199536 

— 261178 

360200 

471476 

- 084324 

- 110374 

840324 

1 099924V' 

(2) 

(1) (26078) 


360824 

-052114 

413180 

094076 

- 299244 

— 022024 

275224 

219474 

«'> 

(2') (3 239286061) 


308710 

1 

507256 

1 643147 

- 321268 
-1 040679 

494698\/ 

1 60246! y/ 

(3) 

(0 (-471476) 
(2T{-I 643147) 



1 408320 

— 169826 

- 833496 

113360 

039757 

527891 

2 295060 

- 396193 

- 812862 

(3*) 

(3') (2 469147995) 



404998 

1 

681008 

1 681510 

1 ! 

(4) 

(1) ( 110374) 

(20 (1 040679) 

(3')(-l 681510) 




1 763939 

- 009307 

- 334337 
-1 145122 

1 493731 
092750 
514822 
-1 8261)0 

(4) 

(4*) (3 634077471) 







Caicuiadon or Partul CotrnciENTt 


(5) 6,, = <l ,ld.. 1 

(6) 

(7) rf?, 

-1 1 
275 717092 
275 717092 

-1 849845 
1057 507961 
943 483898 

851894 
211 617583 
190 809412 

- 415085 
60 342871 
47 504750 

-2413036y/ 
1585 18350^7 
1457 315152 

(8) (6) - (7) «,) 

0 

94 024063 

20 808171 

12 838121 

127 670355%/ 

(9) (8)( 000227) 

(10) $b„ = ^(9) 
(tl)r;,°(7)/(6) 

(11*) 

0 

021343 

146092 

909375 

954 

004723 

068723 

901671 

930 

002914 

053981 

787247 

887 



Calculation or Muinrii CoErnciCNTS 


(12) 


/(16 6W73IK 763984)- I II 685749 ./-7T77?; 

(16 6(M731K 763984) ** 12683749 “ ’ ' 


959777 


Calculadon or DtADtumo CotrriaENTS 



Vsriabic 

b 


IB 

kl'^k.lkt 

(13) 





(14) 

X, 

-1 849845 

146092 

01 

10 

(15) 

Xf 

831894 

068725 

001 

1 

(16) 

X, 

-415085 

053981 

01 

10 


(17) 5, (013711/01)° 1 3711 « -^54 630000 -(- 36 133814) « 90 8I38I4 

(IB) AT, = 90 813814 - I 849845^*+ 083189^.- 415085^, 

( 146092) (006S72) (05398 0 


512 











































514 Apptndix 2 

using a s'anauon of the Doolittle method that omits the conventional 
back solution 

Steps involved in the forward solution of the Doolittle method arc 
given here in full detail Expenence demonstrates this as the easiest way 
to learn how to carry out these operations Once the general approach is 
/earned, many of the computations shonn indiuduaJly in Table A2 II can 
be cumulated directly in the calculating machine Use is made of all 
possible shortcuts of this kind in the so-called abbreviated Doolittle 
method This is the method described by Klein * 

Computations involved in the forward solution of the Doolittle method 
are shown in Table A2 11 as follows 
In rows (1) (4), columns CIM4). enter the adjusted augmented moments 
computed above The reader will note that the arc listed in numerical 
order, in the method used in the usual solution elsewhere in this book, 
Aj IS placed after the last independent variable Compulations involved 
in obtaining Table A2 10 and the adjusted augmented moments should 
be carefully checked as no automatic checks arc available for these steps 
Additional columns /,, one for each variable in the analysis, arc added 
m columns (6H9) The makeup of these is obvious from the table 
As an alternative, data shown m the upper section of Table A2 II can 
be recorded directly as the first row of each subsequent section 
In this forward solution we carry two check columns S,, column (5), 
for that pan of the solution concerning the x's, and 2;, column (10), for 
that part of the solution concerning the /’s For the upper section of 
TablcA2ll that IS, rows (l)-(4) these columns are obtained in the follow- 
ing way The element m the ith row* of the S, column is obtained by 
adding together the elements in the ith row of columns (I)-{4), including 
the omitted elements The element omitted in the ith row and yth column 
can be found m the yth row of the ith column For example, the omitted 
element in row (4) column (3) is the clement in row (3), column (4). 
namely, 113360, etc The element m the ith row of the £/ column is 
obtained by adding the elements in the ith row of columns (6)-(l0) 
Because of the makeup of the columns, however, each clement m these 
rows of the 2; column equals I In the computations outlined below, 
2^ and 2/ are treated as additional variables, with all the operations 
performed upon them 

Only the second row in the first section and the last two rows in each 
succeeding section of the solution are checked This is done m two parts, 

•LawrcnccR A Klein, Row, Peterson, andCo .Evanston 
III and XVhitc Plains NJ.pp 151-155,1953 
* Fooie and Friedman from whom this explanation is taken use ‘ row” where 
previously in this book ‘ line has been used 



516 Appendix 2 

Row (3 )(-l 681510) Muitjplyro»(3')by“l 1681510 Thisfactoris 
the element of row (3'). column (4). uith us sign changed 

Row (4') Add row (4) and the three rows following it and pciform the 
check 

Row (4') Divide TOW (4 ) by Us first term, that is. by 0 275173 and 
perform the check Or, multiply row (4') by 1/0 275173 ** 3 634077. 

This completes the forward solution 

Unfortunately, the checks do not guarantee that the correct multiplicand 
has been used, they only prove that the multiplications were carried 
out correctly As a final check, tt is suggested that the multiplicands 
shown in the stub be examined to make sure that the correct value was 
used and that these then be used to rccheck the compulations m the 
column [column (10) m Table A2 1 1} Experience in our central computing 
unit has indicated that occasionally a statistical clerk is interrupted 
between the computations involved m the x and the I part of the table 
and that the wrong multiplicand is used in the latter set of computations 
It seems unlikely, however, that a wrong muUiphcand would be used m 
the ar part ofthe table and the correct one in the / part When the abbrevi- 
ated Doolittle solution is used, this final check is not needed, as the 
computations are carried out on a column-by<oIumn basis rather than 
a row-by-row basis 

D Matwx The D matrix is shown in Table A2 11 immediately 
following the / part of the forward solution, at the foot of columns (6>>(I0) 
Its computation involves the terms in the last two rows of each section 
m the / part of the forward solution The clement in the ah row and yih 
column of the D matrix, ts obtained by the following formula 

du - (1. /,)(!' /,) + {2 . /.)(2', /,) + (3', /,) + (4'. /,)C4', /,) 

where the first term within the parentheses refers to the row and the 
second, to the column designation of the elements m the forward solution 
Therefore 

For column /„ d^^ = (1)(1 308928) + ( 261 178)( 846030) 

+ (- 900630)(-2 223789) + (1 896594) 

X (6 892370)= 16 604731 

and for column /*,</„ = (l)(0) + (26n78)(3 239286) + (- 900630) 

X (-4 057173) + (I 896594)(13 822748) 

= 30 716183 

dzz = (0(3 239286) -1- (-1 643147)(-4 057173) 

-F (3 803647)(13 822748) = 62 482672 
and so on for each of the items 



s/8 Append?x2 

(/„. the element m the second or r* column equals = (16 604731) 
(62 482672) =* 1037 507961 , and the element in the X column is obtained 
by multipljmg f/„ by SJj, Sdjj is the sum of the diagonal elements of D 
and IS shown in the last column of row (10) That the sum across row (6) 
IS identical (except for rounding error) with the element m the column 
IS indicated b} a check mark, or by correcting the last digit, as shown 
Row (7) Compute Jj’j. that is. the square of each of the elements in 
the first row of the D matrix, excluding the element m the 2 column 
For example, the Item m the secondcolunitiofrow(7)isrffj *= (30 716183)*. 
The clement in the 2 column of row (7) is the sum across the row The 
check on this row is one of recompulation 
Row (8) Subtract each clement of row (7) from the clement m the 
corresponding column of row (6) including those in the 2 column That 
the sum across row (8) is identical (except for possible rounding error) 
with the clement in the 2 column is indicated by a check mark This 
checks the compulation of row (8) 

' Row (9) Compute l/(^ — where {N — m) equals the sample 

size minus the total number of variables, and i/f, is the square of the 
element in the first row and first column of D The value of i$ given 
tn the first column of rows (6) and (7) In this example, (A' — m) 
equals 20 — 4 and t/J", equals 275 717092, therefore 1/(N — » 

1/4,411 4735 e 0 000227 Multiply each element m row (8) by 0 000227, 
including that in the 2 column That the sum across row (9) is identical 
(except for possible rounding error) with the item in the 2 column is 
indicated by a check mark, or by correcting the Iasi digit 
Row (10) Compute the square root of the clement m the corresponding 
column of row (9) except the element in the 2 column The elements in 
row (lO) arc the standard errors of the coefficients m the corresponding 
column of row (5) The check is one of recompulation 
Where the number of vanabics is larger or smaller than tn the examples 
shown the number of columns from (l)-(5) and (6)-(10) and the number 
of rows from (1) (4) and m the P matrix, will change accordingly, but 
not of course, the rows from (5)-(l I ) 

CoEfnciENTS or Partial Determination The calculation of the 
highest order coefficients of partial determination (the square of the 
partial correlation coefficient) is shown in row (11) This is done as 
follows 

Row (II) Divide each clement m row (7) by the element m the corre- 
sponding column of row (6) except the 2 column The elements in row 
(II) are the coefficients of partial determination The clement in row (II) 
column (2), for example, equals rf-si The check on this row is one of 
recompulation 



520 Appendix 2 

Column (7): The deadjusicd standard errors of the b'$ arc obtained 
b} multiplying the j,, column (3), by their respective 
Column (8). Enter the means of the sariables from Table A29. 
Column (9) Computations in this column are used in obtaining the 
constant for the equation MuUipty the deadjusted h\ column (6), 
by the mean m the corresponding row of column (8) and add the figures 
in column (9), or cumulate the products directly m the machine The 
constant m the equation, a, is obtained by subtracting the cumulated 
product from the mean of A'„ the element in row (J 3), column (8) Hence, 
a 54 6800 — (—36 1338) ®= 90 8138 This result can be recorded 
directly as the constant in the regression equation shown in row (18) 

The final regression equation, in the following form, is shown in row (1 8): 

A'l a 4- hiis4A'»+ biiwA's + bjttsAi 
The figures in the table within the parentheses are the standard errors 
of the respective regression coefficients 
Row (17) The standard error of estimate, also must be dc« 
adjusted This is done by dividing J?,*,, by /rj. The latter is given in row 
(13), column (14) In our example, = 0 01 , therefore, for this example, 
the standard error of estimate is 00137! I/O 01 » 1,371. The indicated 
computation is shown in row (17) 

The coefficient of multiple determination need not be deadjusted. 

The check in this section is one of recomputation 
If all of the k' equal one, columns (6) and (7), for lints (14)-^16) can 
be omitted In this case, column (2) is used in place of column (6) in 
obtaining the constant in the equation m column (9) 

This completes the solution and compulation of all 4he multiple 
coeffiaents 

Eliminating or Adding Voriobfes. If one or more variables are to 
be eliminated or added, the measures of correlation and regression can 
be obtained without rerunning the analysis. 

Eliminating Variables Application of the formula given below, 
which applies if one variable is to be eliminated, yields elements of a 
similar D matrix, for all variables except the omitted one, 

The elements of this matrix, the can be obtained by the formula: 

,, diAt — 

-j 

Okk 

where the d's are the elements of D These values are used in place 
of the corresponding rfy values m the computations beginning with 
row (5) of Table A2 11 

This formula was su^ested by Frederick V. Waugh. 



Mithodi cf Ccmpuiciiori 


For example, if x^ v,-crc to be dropped fto;r. the prcNiro'; '.'-c 

would compute the first row of D . . that jt. <?„ ^ , rrd J. b\ the 

formula: 

_ ‘’'.dut ~ 



If we consider the adjusted augmented marr-en*s of with the other 
variables given in the fint row 'of Table A 2 JJ as a chech on the 
computation of the first row of is gisen bv computing mjtifj; (. u- 
Wj2f/i2)„ -f r.'ijjffp,.-. This sum should equal L 

It is not necessary to compute the entire D.j, rnairiv. In addition to 
the first row. we need compute only the diagonal element^, that is. i/,, , . 
given by the formula; 


H *' .t 




The partial regression coefhcicnls can be obtained by: 

- -'Vio 

' j< u' -- , 

‘hvv 

Their standard errors arc risen bs ; 

-N A’Vfjxj 

The coefficients of partial determination equal; 

e _ 4,0 

r _ _ . 

The coefficient of multiple determination equals: 

_ d,vt'^<hi - 3 

ruj, 

The standard error of cstima.tc is cben by; 


-1 \! W'yf 

Similaris, if both and were to be eliminated, we would cosnputc 
the elements of the (hinn. 25 foilows: 

f r/. ( 'if., s j / 

‘v-ut 

Tlius. if more than one variable is to be eliminated, the com.put.atto.ns 
must he done in steps by cUmir, sting them one at a time. 



S27 ^ptndtx 2 

U$c of the above formula iseas> ifonl} one vanablc is to be eliminated 
It becomes more difUcult as additional variables are dropped Sometimes 
the analyst knows fairly well in advance wh>ch vanabics may need to be 
eliminated If so, he should use them sis the b'S^esi numbered independent 
variables Thus, in a five-vanabic problcrn. if A4 and \j were to be 
eliminated, this could be done by dropping columns (4) (5), (10), and (II) 
and rows (4) and (5) and their corresponding sections in the forward 
solution The D matrix then could be easily recomputed and the remaining 
computations carried out as m any ihrec-vanable analysis New check 
sums for use in the computations beginning with row- (6) probably would 
be advisable 

ADDING Variables In general, it is easier to drop vanabics than to 
add them Hence, as many variables as arc likely to be used should be 
incorporated in the initial analysis, some of these can then be dropped 
if this appears advisable At times, however, a variable will need to be 
added Assume that the added variable is Aj Use can be made of all of 
the computations already made in the forward solution Columns are 
added between the former columns (4) and (5) and between columns 
(9) and (10) and a row (5) and a corresponding section are added in the 
forward solution Figures m these columns can be filled in by performing 
(he same sort of computations that were done previously An additional 
product from the new section (5) will need (0 be added to each of the 
elements in the original 0 matrix, and a new column (5) and row (5) 
should be added These steps can be checked by recomputatton or by 
use of a new check sum All of the coefficients should be recalculated, 
making use of the new D matrix and of new check sums 
Standard Errors of the function end of forecosts. The standard 
error of a point on the regression equation, or function, relates to a point 
on the regression surface corresponding to specified values of the inde- 
pendent variables (note Chapter 19) 

For a four-vanable multiple regression problem, the square of the 
standard error of a point on the regression equation, or function, is 
given by 

~ — Afj)* + NCzc^X^ *" Afj)® 

+ ^^•*.(A'i - Af,)S + aAlCjjfA*, - AfjXA'a - A/g) 

+ 2Nc„{X, - «,XA. - M,) + 2Nc„{X, - M,V.X, - A/,)j 

where is the deadjusted value of the square of the adjusted standard 
error of estimate obtained by squanng the deadjusted standard error 
of estimate, N is the number of observations on which the analysis vs 



S2J 


/•’mhcds cf Cor~p’j;cus'i 

b.i-cd.'* nnd ihc r,, nrc oh:rtTn.r<i! fc<^ni the dc^l^nt'■ of the D tr.-itnx d n 
in TnWe A2. 5 ! by tbc formnb : 


If Ihc'^c Naluc-^ of c,, arc '-ub^tiiutcd in thi fonnuia for she square of ’.he 
standard error of the function. .V and t/.[ appear in c.ae'n of the prods:;’,, 
svitiiin the bnicS-cts. [Compare with c<ju:iti{'n (19.2) j We can, tlsercforc, 
rewrite the formula as; 






X 


i.v • s/„ L 


sCfA'i - A/,)- -h etc A'. - ,!/,)“ 

+ - MS' 4- 2r.;.f.Vj ~ MA 

y- (.V, - M.) -b 2r:.(A; ~ MSX. - M,) 


-f 2c:, (A', _ - A/,) I 




where S ~ ~ Thus r^, = f/it'A-s ~ <“t: *-‘ 

~ The computed c' arc sho'.\n in Table A2.12. 


Table A2.12 


tors tfti roi7R-V,Msi.\ti!.r Mfi.rifsr RroRt.otnv rfsf'nitvf 


Outline 

Values 


^tz 

5)4.0;414 -240!?', 

-P.s1.s5 


:o.?sas2 

-f>25’^s 

‘■'0 


12.??si 


cf.. (•,'’ 3 . and cj, were computed in row (S) of Table A2.i!, column » 
(2). 0). and (4), respectively. The other c' miot be computed dirceth 
For example, c.J. = ~ Substituting the vuIucn from the 

D matrix, we obtain: c'.^~ (I6.kw)(- 27.700} f- CAIKCU- 17 Al7-i) 
— — 29.01S6. The computation of the v',', can be chtched b} computinr; 
the following sums of products: 

(2) -b 

(b) c'.ri^ -f -4- 
(,c) rf.or., 4- rftor., 4- 

” iV jv rc.juireil rn .'sh terms .•sfscr the I'-nr wsih't the br-sciiets 
of .auensrnstd n'orneriK in the cenp Jt^sio'"'- 



S24 


Appendix 2 

where m„ IS ihe adjusted augmented moment of x, on shown in rows 
(I)-(4), columns (IM**)* of Table A2.1 1 Each of these sums of products 
should equal </j,, except for possible rounding errors For example, 
to check the first row of the e^, >xe compute (a) (94 0241X360324) 
+ (-29 0186X413180) + (17 8I35K- 299249) « 16 6017 The second 
and third rows are checked by computing (b) and (c), respectively These 

arc adjusted terms 

For use in the formula for the standard error of a function, the 
must be deadjusted This is done by multiplying by kjcf^ths appropriate 
adjustment factor from Table A2 10 For example, to dcadjust Cj,. 
17 8135, multiply by or 000001 Therefore, the deadjusted s-alue 
of fj4 =s 0 00178 The nature of the formula is such, however, that 
IS nerer deadjusted 

The means that are used in the formula are oblamed from Table A2 11, 
rows (I3H16) of column (8) These are given on a deadjusted basis 

Inserting the deadjusted standard error of estimate, Cy, means, and the 
adjusted d^ m equation (A2 3) gives the formula for the square of the 
standard error of the function 

The standard error of the function is obtained by inserting the specified 
salues of A't, A) and for any given observation and taking the square 
root of the result 

The standard error of a specified forecast is obtained from the following 
formula 



where Si^ is on a deadjusted basis 
Use of on Alternative Variable os the Dependent One. All measures 
of regression and correlation given m preceding sections arc based on the 
use of -F, as the dependent variable If, after the analysis is run, it seems 
desirable to have one of the other vanables, X,, as the dependent one, 
the \arious statistical measures can be obtained from the original D 
matrix by use of the following 
The partial regression coefRctcnts equal 



If, for example, A'j is to be used as the dependent variable in the five* 
variable problem given above, we would compute and ij, ,3. 

wherein 34= — etc 



MiO'.cdi cf Comp-jtaxio’) 

The •standard errors of the rcore^'iurn coe?l-e;cni«: a'r rs^c: 


SIS 


0 % : 


f. Kr 




^ i,v -/.,<)/• 


For example, •= rhifil 

■ K1. V(,v-m)/;, 

Tile coefneients of partial dcterniinalion c^ual: 

•» I 




For example. 




The coefTidcnt of multiple determination iv niven bj. ; 

- 1 


For example. 


li: 




t/wr;., — 1 


The standard error of estimate equals: 


* — its)d„ 


For example. 


•>>ni — J 


1 


^ AX A' - m).A. 


It should be noted that uhen variables are eliminated, added, or 
interchanged, the regression coefticients, their standard erron^. and the 
standard error of estimate must be rc.idjustcd Iscforc they can h-e applied 
to the original data. .Ail of the formulas shown apply to adjusted salues 
This completes the explanation of the Fnedman-Foote meth.-'d of 
.solution. 


Computing Residuals for Graphic Multiple Curviltnc.xr Regressions 

Where there are ,i large ruiml'cr of indisidual observations, the average 
residual around tlie net regres'-ion line may be computed from gr-.vup 
averages, instead of calculated for each individual obvenatson as dr'eribsd 
in Chapter i~. This rnav save much time in c.alrdating the .averag; 
residuals to obtain the tin-l appro\im.ation regrc‘smn curves 



526 


Appendix 2 

After the net linear regression cocffiacnis arc computed the observations 
are thrown into groups with respect to the first independent factor sa) 
and averages of each factor are computed for the records falling in 
each group If there arc four groups for example there will be four sets 
of averages 


Value of Ij 

Averafic 

Average 1, 

Averaee A", 

0 9 

Af, , 

Af, , 

Af, , 

10 19 

\/, , 

\f, . 

I 

20-29 

Af, , 

SU a 

Af, , 

30 and over 

Af, 4 

Af,-4 

'^1 • 


The average estimated value may then be calculated for each 
group by substituting the means for that group in the regression equation 
Thus for the first group 

U, = a + t) + 

and 

U, « l/i j - I/, 

In a similar manner the average residual may be calculated from the 
group averages for each of the other groups and then plotted ns a 
departure from the net regression line as illustrated in Figure 14 5 After 
the computation is completed for Xf the records may be reclassified 
with respect to Aj new means calculated for each variable for each group 
and the process continued just as for A'j The same steps arc earned out 
for each other independent variable m turn This method may be used 
to determine the net residuals around curvilinear regressions fitted by 
mathematical curves just as well as for linear regressions 
Once the first set of freehand approximation curves has been drawn 
the remainder of the work has to be earned forward just as described m 
Chapter 14 as the average of values along a curve do not precisely 
represent that curve in the same way that the average of values along a 
straight line will represent that line 

Auxtitary Graphic Processes wtth the Short Cot Graphfc Method 

The short-cut method of determining i net curvilinear regression 
described in Chapter 16 may be matenally aided by using graphic methods 
in transferring departures from one figure to another, and in calculating 
the avcraces of the values as plotted 
After the original observations arc plotted and the first approximation 
to the regression line or curve is drawn (as in Figure 16 2) the departures 
from that line must be plotted against the next variable A procedure for 
making those transfers graphically is shown in Figures A2 I to A2 5 



578 A^nti\x 2 

Figure A2 3 shON^s this step just after the value for the 1920 observation 
was entered on the chart After the value »s marked on the new figure, 
It js crossed off on the slip, to prevent confusion Figure A2 4 shows the 
process completed, just as the last value on the slip— that for the 1933 
obscnation— IS entered (It will be noticed that the values arc transferred 
m sequence from top to bottom of the slip, to prevent confusion ) 




Fig All. This shows the process of scaling olT the departures 
partially completed 

After the new curve is inserted on the chart, the next step is to transfer 
residuals from the new curve to (he next figure The departures can be 
scaled olT from a curve as readily as from a line Figure A2 5 shows the 
start of the next stage of the process, after the departures for the observa- 
tions for 1920 and 1937 have been scaled oft from the first approximation 
curve on Figure 16 3, and just as the value for 1936 is entered The 
process is completed and earned on to the next chart (Figure 16 4) 
just as illustrated above The same process is used m transfernng the 


cf Cc.-np'jtctfsn 


S19 



Fig. A2J. Hire the siip v.iih the dfpirturc% f:on ti*; fiT'i Erp'r'K-'r.'.'if n 
is mosed to the next chart, ready to start !r.>n‘'’e'n-.- the drpi'l'jre*' t.' prt 
the first approjjmation for the r.e\i vartah'c Tm f''t cb'cr-e;:' a. for 
1920, has been entered c-nd cheeVed o'! 







530 Appendix 2 

departures for each stage in the approximation process, alu.a>-s scahng 
off the residuals //-om the last approximation curse, and plotting them as 
departures from the last curse on the next chart, prior to drawing in the 
new curse 

After the departures are entered, aserages of departures arc sometimes 
needed In such cases, graphic means can be used to ascrage each group 



Fig A2-5 After the first (or subsequent) approximation curses arc 
drawn in ihe departures from (he curve can be scaled off as shown 

of obsersaltons To do this, an approximate average is inserted by eye 
Then all the positise departures of the observations in that group from 
the approximate average are accumulated on one slip, scaled off each m 
turn as an addition to the other departures, and all the negative departures 
from the approximate average are accumulated on another slip The 
dilTercnce between the two accumulations is divided by the number of 
cases, giving a plus of minus correction to the approximate average 
At later stages when average deviations from a previous line or curve are 
desired, graphic accumulations can be used similarly, with the previous 
line used as the first approximation to the new average 





APPENDIX 3 
Techntcdl notes 


Note I. (Chapter 7). To pro\c for a \ciry jimp’s ca'c Ihni r;, rnr.i^iirri th: 
proportion of Nariance in Y ctplnincd by A. L^t Ct r, etc., be *-cric< of 
with = Oi, = <T„ and with all intcrccrrclations ruch a<i r,., r^., etc., ~ P. 

Let 1' = a + /> + c, and A' = a + h. Tlicn r!, — 

I^Hcre the symbol p„ is used to represent . j 

each (</)(t) = (a - 1 - i j- c){a -j- h) 


Since 


= o’ 4- 2a/i 4- oc 4- ft’ -r he 


E(ab), ^(ac), ^2(bc) = 0 


Similarly, 


i^Cv)W = i:o’ 4- 
(!'■) = (o 4- 6 4- c)’ 


By similar proof, 
Hence 


a’ 4- 2ah 4- 2ac 4- 1>' 4- TJ'C 4- c- 
SO/*-) = Za- 4- Zb' 4- Sr’ 

zey = 4- Zh- 


(a; 4- cl)- 


(a; 4- al 4- o’)(aI 4- cl) 
4- c\ 


} (since c, c, 


a,) 


Similar results will be obtained for other combinations of elements. 

Note 2, (Chapter 12). it can be proved that the coefneien: of multiple deter- 
mination (/{’) measures the percentage of variance ascribablc to the scsemi iadq>e-.kr.i 
factors for certain simple eases. -Thus assume four s.trb.b’es, ,4, /?, C. fL '-lith .tl! 
intercorrclations equal to 0, and all acqu,i). Let )' « ,4 4- 7? 4- C. Tl'rn cr'rrcl.itc > 
with /(, B, and D. The regression equation will work out 

r ^ a -r A -i- or-fOf f)j} 

S3I 



532 


Af>f>cn<jlx 3 


Computing ^ equation (I2J). 

M-ya + bt, + btijiX-yd 

4MO ■■ ■ ^ 

tach 

(y)fa) " (a + A + e)(e) »* o* + «A + ae 

Il(ya) ■» So* + SoA + Sae " So* (^ince 0) 

Similacly 

S(:<A)-SA*. S(!^J)-Srf* 

Atwl each 

((,«) 1 (a + A + 0* « a* + 2aA + A* + 2oc + 2Ae + c* 
and 

S(y*) » So* + SA* + Sc* 


Hence 


Rr 


(l)(Sa») + t(SA») + oeSrf*) 
So* + lA* + Ic* 


And since all o's are identical, 

AtO “■ I 

Ift this case, then, when composed of three equally variable non<orrelated elements, 
IS correlated with two of those elements, and with one other equal element which is not 
represented m Sand which is not correlated with elements present in Y, the multiple 
determination of Y by the two elements (A and B) is found to be I 
Similar results wdl be secured for other experimental cases which may be tel up 
Note 3 (Chapter (2). CoelTkients of partial correlation are usually defined by 
the formula 

rjt — fttftt 

* ** - = 

Vi - Vi - 1-;, 


Tor coeflicients with more tarubles eliminated, such as r«, 
becomes 


rism 


^im ~ fn n^i$ u 

V (I — rji mKI — r/i ti) 


i, for example, this 


To determine the cocfllcicnts with several factors held constant by this method 
involves a lengthy process of elimination, variable by variable, and for that reason 
the method presented in the text is preferred as shorter, simpler, and more readily 
subject to checking 

Note 4 (Chapter 17) Reliability of observed correlations Figures 17 2 to 17 5 
provide a ready means of jurying the probable minimum value for the correlation in 
the universe, with any observed value and any given sire of sample The chart is entered 
with the observed correlation as abscissa, the ordinate for the intersection of that 
abscissa w ith the curve for the given size of sample gives the probable minimum cor- 
rtlaiion Thus if the coefficient of simple correlation, «= 0 65, is obtained from a 
sample of 22 cases, the researcher will know from Figure 17 2 that, if be makes the 
statement that the true correlation in the universe is at least 0 38, he will be wrong in 
only 5 percent of such statements, on the average Figure 17 3 applies to /?i m, Figure 
17 4 to Rt itui,Md Figure 17 5 to Ri t*<MM Values for 2,4, and 6 independent variables 



534 Apptndix 3 

m {?• can be fined separately by the method of least squares These equations constitute 
the '‘reduced form" of the model and are called "redu^ form cr^uations” to distingucsh 
them from the structural equations. 

The coefficients of the structural equations can be derived from those of the reduced 
forms by, tn essence, reversing the process of derivation shown above Thus, the 
coefTiaents of T in the two equations differ only by the factor “d,” m one of the 
numerators, the denominators are identical We can therefore estimate b, as follow's; 



Similarly, the cocfhcienu of Z differ only by the factor "br, and bi can be estimated as 



Knowing these two values, A, and A,, we can estimate e, as 

and Ci as 

TTjff value* of o, and o, can be derived by moJtiplying the numerical vSlue* of 
(a, — Oi)/(A, — Aj) and (a, A, — tf|Aj>/(A, — As) each by (A, — A|), then, calling the 
resulting values kt and k„ w« have two simultaneous equations m the two unknowns, 
Oi and Os 

O, — 0| a k, 


b,a$ — A|0| 


*s 


Multiplying the first equation through by As and subtracting it from the second equation, 
we have 

c/b, — A,) “ *s -* li^ihs 

and 


where every element to the right of the equality sign u a known number. Substituting 
the numerical value of o, in the first equation, we obtain 

Os •= Os — Its 

Our estimates of the coefficients of the sinictura) equations are now complete 
Note 7 (Appendix 2). In computing means from rounded-off data, errors due to 
rounding tend to compensate for each other, so that m large samples the means may be 



Technical Notes 


535 


carricJ out to more significant digits (or decimal pbcc^t than arc gisen n she ir.Jr.i.VasI 
items from svhich they arc computed. In division and muUiphc.-.tic'n. hc-v-rscr. ?h-,s is 
not true, so that in general the product or dividend wii! not he accurate to :.s mr.y 
decimal places as the numbers from which it was calcul.neJ. n'c'c prr.cipitv cn re 
readily demonstrated by arithmetic examples or mathemr.tical moc'-U For thc-e 
reasons, slight differences in values of various stativtics may rcwilt mere;) from vFf- 
fercnccs in how many decimal places they arc carried out to in the c-omput.vtioro Wi etc 
the computations arc carried out to four to five significant digits, the d'-Tcrc-rc'. 
rcncrallv tend to be insignificant compared to the standard errors of tlic 't-stot..-^ to 
which they relate. While it is customary to carry elaborate or long sene, of ca'ce'ctiors 
{such as those shown on pages 507 to 51?) out to sit to ten decimal pl.actv. to m.mim'rc 
errors due solely to rounding, the final results arc not usu.tlly shown to more than two 
or three, to avoid giving a spurious impression of cvact accuracy . 

Similarly, in working with squares and square roots, the squares must be earned out 
to twice as many decimal places as the square roots, to maintain the same degree of 

accuracy in the computations. . , , , 

Note 8 (Chapter 17). Beta coefficients, like cocfTicicnts of corrc.v.mn. can tv: 

strongly influenced by purposeful selection of the sample vi.!ue-. of one or more of the 
independent variables. This similarity ls suggested by the fact that 
identical in the ease of simple linear regression. If beta coeffiaents arc c.aUu a ; 

random sample drawn from a “bivariate" or "muhivanatc norma! po, u.. non. ih.y 

qualify as esUmates of corresponding parameters of this popubtion. Buti, vr 

or mo^rc of the independent variables arc fixed by the mvestigaior, .is in a con.to. .J 
exocriment the bota^cocOicicnts have sampling significance only m re >nnn to a very 
in which, roughly epeuUng, ,hc centod ucw.uo-. o, cch 

independent variable is held perfectly consent for all ^ 1,,-nMmns 

Ordinary' simple and net regression coefficients are not sub.Cw. to 
Thurieta coefficients arc of doubtful value except m the sa<a led i a../ 

giving random sampling from a normally distributed ' natural umv.rsv 



Author Index 


Aandahl, Andrew R., 467 
Aitchison, J., 186 
Albert, \V. \V., 460 
Allan, D. H. W., 185 
Allen, R. H., 466 
Allfwrt, Gordon W., 465 
Anderson, R. L., 338, 347 
Arroorc, Sidney J., 324, 463 
Armstrong:, David T., 460 
Attridge, R. F., 185 

Bandecn, Robert A., 466 
Barnicoat, C. R., 460 
Bartlett, M. S., 334, 347 
Basu, D., 467 
Baum, E. L,, 466 

Bean, L. H., inventor of short-cut graphic 
method, 254 

leader in correlation use, 435 
on cotton supply-demand relations, 
462 

on farmers’ response to price, 463 
on freehand versus algebraic curves, 
108 

on graphic short-cut method, 277, 27S 
on potato and cotton supply and de- 
mand, 463 

on predicting elections, 465 
on v'Oting patterns, 465 
Bean method, basis for, 169 


Beckman, F. S., 186 
Been, Richar.I O . on c ' rr'l’jpk- 

rerrc'S'on resutf--, 32 ‘ 

Bell, G. D. H., 460 
Benner, Claude t,., 164 
Black, John D., input-<’>nf,j! -nri;, 45'> 
inspircr of research, 434 
on graph.r method, 108. 273 277 
study of rrcr.mcri o'-pan:r.>'rn 4'd 
Blackmorc, John. 466 
Bo!‘'s, J X . ‘-'u 
Breimytr, Ha'^old F.. 46'> 

BrounofT, P I., t59 
Brovn. C 1., 401 
Brov.m, J. A. C , 1S5, 186 
Brov n, Jean, 466 

Brown, W'ilh.aw G 357, 377, pv?. 
40a 

Bruce, DonslJ, 377, 151 
Burm'i-ter. Cu'tave 45'> 

Btiriis, t>isar L , 324 
Butler. O. D . 460 

Calrada, Jo"''. 450 
Cart'-.-richt, T. C . 400 
C.'>s.id\ , R. B , 460 
Castle, Ehraleth J . 460 
Castle, M. E., U-d 


Cavin, jamcr, -’ 
Chamberlin Edwar !, t 


537 



538 


Chamber*, C,, 442 
ChatfeM. Charlotte, 460 
Chauncey, Martin R , 465 
Cherniak, Nathan, 446, 462 
Clark, Cohn, 467 
Clough, Malcolm, 463 
Cochran, D , 343 
Cochran, W 0,1 
Cochrane, \\ jllard W , 467 
Coleman, D A > 460 
Colhns, DaMil N , 466 
Court, Andrew, 350, 377, 435 
Cover Sylvia 460 
Coaan Donald R G . 465 
Coaden, Dudley J , 81 90 
Co* Gertrude M . 1 
Croxton, Frederick E . 81 90 
Crum, W L , 465 
Cume, Lauehlin 455 

Dean, Joel, 464 
de Baca, R C . 460 
Derk«n.J D D , 464 
Dtedjen*, V A , 464 
Dixon H D 460 
Doolittle, M f( . 506 
DousUf, Raul l{ 130. 464 
Duesenbcrry Jame*. 139 465 
Dwyer, P S , 507 

Efliotc, Foster F , 463 
Erdman 11 E 446,462 
E»ing, E C 460 
Ewing. E C , Jr 460 
Ezekiel, Mordecai a check on a regres- 
sion, 466 

dairy farm studies .Minnesota, 459, 
Montana. 459, Pennsylvania, 461. 
Wisconsin. 459 
input-output study. 459 
on curvilinear regression, 117 
on error of graphic net regressions 253, 
305 

on error of net regression coefliaents, 
286 

on freehand vs mathematical curves, 
108 

on gains from efficiency, 464 
on graphic method for multiple curvi- 
linear regressions, 248, 277, 278 


Author fnc/ex 

Ezekiel, Mordecai. on hog prices. 462 
on joint function, 377 
on lamb pncts, 462 
on meaning of correlation coefficients 
ISO 

on multiple correlation method and as 
sumptions, 187 

on need for simultaneous equations 
system. 427 

on outlook forecasting, 466 
on savings, consumption and invest- 
ment, 455. 465 

on statistical analysis and U«s of 
price, 465 

on steel cost curve, 256, 464 
on Virginia toboceo farm organization, 
461 

FAO (Food and Agriculture Organiza- 
lion). 447, 463 
Fellows H C,460 
Fetger \k\t\h F , 4W 
Fisher, Malcolm R , 186 
Fisher, R A , on "c” constants, 500 
on computation methods, 507 
on mrrelaiion significance, 293 
on fitting time senes, 90 
onumplmg distribution of R,293, 295, 
305. 533 

on use of itaUsiics, 1 
study of wheat yields, 436, 459 
treating tune as independent variable, 
213 

t transformation, 193 
Folse, J A , 461 

Foote, Richard Jay, on Bean method, 
169, 278 

on coffee consumption, 467 
on corn and feedstoffs prices, 449. 463 
on matrix computation method, 187, 
413, 507 

on simultaneous equation method, 
433 

text on agricultural prices, 451, 464 
Fox, Karl, chart from, 345 
on consumption studies, 446 
on demand for farm products, 462 
on price forecasting, 462 
on simultaneous equation systems, 
424, 425 



Author Index 

Fox, Karl, on Rupplv-dcnrand jntcrac- 
jion*^. -149, 462 
textbook by, 451, 462 
Foytik, Jerry, 462 
Freeman, Frank S.. 465 
Friedman, Joan, 187. 413, 433, 50/ 

Friend. Irwin, 467 
Fulcher, Gordon, 467 

Gabriel, Harry G., 464 
Cans, A. R-, 463 
Garrison, K., 465 
Gauss, 507 
Gelfand, L. L., 466 
Geological Survey, 461 
George, James P., 466 _ 

Girschick, Meyer A., 278, 463, 50/ 
Goldbergcr, A. S., 425 
Gollnick, Heinz, 463 
Goodenough, F. L., 465 
Gorcaux, Louis, 113, 463 
Gosnell, Harold, 465 
Gowen, John W., 438. 460 
Grafius, J- E., 460 
Griliches, Zvi, 467 
Gunn, Grace, 464 
Guthrie, Edward E., 461 

Haas, George C., 442, 461, 462 
Haavelmo, Tr>-gve, 413 
Hald, Anders, 333 
Hall, Francisco dos Santos, 46/ 

Hanau, Arthur, 445, 462 
Hansel, William, 460 
Hansen, Alvin, 435, 455, 465 
Hanson, R. G., 460 
Harbeck, G. E., 461 
Harbcrgcr, Arnold, 467 
Hardenburg, E. V., 459 
H.irdin, Lowell S., 461 
Hardison, C. H., 461 
Hart, B. L, 341 
Heady, Earl O., on crop 

to fertilizers, 333, 365-6/. 3//, 39/. 

404. 408. 409 

on crop routions and ferinizer u«e, 466 

on economic analysis of fert.hzcr use. 

466 

on farm organization, 461 
Hcald, R* •. 


539 


Hcddcn. V.'. P . 4V,. 
Hdmli/fgtr, Pc'ir. 4'-7 
Higbi-, Ed,ar Crt^'ri t* 
Hildreth. Giffr'd, 

Hok, Erlin,-. 166 
Holme^. ,4. D . IS') 
Ho/’d. Will. ■'ns C , 4'1 
H'cs Fi-in';. -••16 1'2. 
Hopp, l!‘.‘nr\ . 467 
Ho^t/'ms in, W H , .3"3 
Hotelling, liaro! !, 2;s 
HotithakKr. H ts.lrS 


' , •■,61 


I?6 

461 


Hull, Clark L , 

46' 

Hut'on. J. n . 

461 

Hu'.lex. Juh.-tn 

2 

Ives, J. Ru''',.!!. 278 

j,arrett, F O , 

4.13, 46» 

Jehlik, Paul J 

, 120 

Jensen, Etner, 

466 

John*^n, (jlcTin L . 

john'on. She: 


Jones. G T .. 

407 

jonl«.rt. n. 5 

1 . -UO 

Jureen. i-ar' , 

3 57, (27. 

Kant'jr, Harr 

',.162 

Kcaslt'Trv. ] 

E , 464 

Kendall, Mannce G 


••tw, ■* 


4. '6' 


on fi''rpU‘i'’n 


it. • ^ - 

on rank co.-Td.iti'sn i6 
on Ksnspling the' r-,. 21. 4^7 
on standard 'Tro' o: r' , r 
ficicnt, 303 

on str.itified san r •- 
Kevnes John Ma,nr.rd. I'-’’- 

Ki!!ough.HuzhiM02.!62 

Klas=cn. L H , I'-i _ 
Klcm. L.v.'. rcr.ee R , Ka. .. 
Kohhtr. M A., 4t 1 
Koon, R. M- 4« 

Koopman. J . 

Koopm.-ins, T)'’!c 
meth'x! la.'' 
on rcgrc' • r 
328. 347, -tli 
nsuk-tn'oe t',,; 


C . on 


on 


464,: 11 

-I •'tr'r 

I 


433 



540 


Author Mtx 


Koopman*. Trailing C., on tanW frright 
ra\w and constnjction. 4M 
K««J. Ibnt, 467 
Kriitjanvin. L. Durlunk, 467 
Kurt, W. J , 464 
K>Ie. Corinne, 460 

Langbcin, U D , 461 
Laurie. E J . 186 
Legalc*. J E , 460 
LinsJe>. R K.461 
Lupton. Marj. 460 

Malenbaurn. Wilfred, lOS, 273. 277 

Marschak, Jacob. 433 

Mason. I L . 461 

Maunder, A !1 . 466 

McAUxindet. Robert, 466 

McCorkle. Chester 0 , 467 

McNall. P E . 459 

Mehren, George L , 446 462 

Memken, Kenneth W' , 463 

Mensenkarnp L E . 465 

Mettdorf, F !Uns>jGes^n, 463 

Mighell, R L,466 

Mills. Frederick C . 90. 117, 253 

Misner, E G . 212. 436, 459 

Mitchell. \Vesle> C. 435. 459 

Mood, A M . 330 

Moore, Henry L. 435 436.459,462 

Momson, F B , 459 

Muir, James C , 135 

Murray, K A H , 463 

Myers, R M , 460 

Nedeibndsch Economisch Instnut, 464, 
466 

Nerlove, Marc, 449, 466, 467 

O'Onen, Ruth, 46J I 

Omitt, C H . 343 , 

Ottoson, Howard W , 467 

Panse, V. G , 3 
Patton, Palmer, 459 
Pearson, Karl, 434, 458 
Pcsek. John T.. 333, 365, 377, 39T, 408. 
409 

Pettit, Edison, 461 
Pignaiosa, F-, 463 


Pond, George, 459, 461 
Prais, S J , tSS, 186 

Quarles. D A. Jr, 185. 186 
Quam'ngton, Bruce, 466 

Radice, E A , 465 
Raeburn, John R , 375 
Rauchenstine, E , 446, 462 
I Ray. D F . 461 
Relneke. L H , 461 
Rwhe) . Frederick D , 461 
Riley, Ralph. 460 
Robertson, Abn, 461 
Rojko, Anthony S , 467 
Roos. C F . 464 
Ross, n. A . 463 

[ Samuelson, Paul A , 465 
I Sanderson, Fred M , 436, 459 
! Schocnfeld, William A , 462 
S^ulu, Henry, on error of a forecast 
from a curve, 290. 305, 322, 323, 334 
on measurement of demand, 451, 4« 
Schumacker, Francis, 461, 467 
Seltrer, R E , 467 
Shear. S \V , 446, 462 
Shepherd, Geoffrey S , 278, 451, 464 
Shollenberger, J M , 460 
Shrader \V D , 466 
Siegel, Sidney, 466 
Smith. Allan N , 460 
Smith, Bradford D , contribution to cor- 
rebiton methods, 435 
on adjustment of production to 
demand, 462 

on forecasting cotton acreage, 463 
on interest rates and business, 465 
on vteather and cotton yield, 459 
Smith. G E r, 135 
Snedecor, George \V , 187, 280, 397, 399 
Spearman, C , 457, 465, 466 
Spencer, Gordon, 185 
Spillman, William J , 140 
Staehle, Hans, 453, 467 
Stephan. Fredench C, 102 
Stephenson, Charles A , 467 
Stephenson, William, 466 
Stevens. Chester D , 459 
Stone, J R. N , 342, 446, 463 



Author Index 
Student. 22 

Sukhatmc, P. V., 3, b 
Sutherland, H. E. G., 465 

Sz-irf, A., 463 

Tapp, 

Taylor, C. C., 461 
Taylor, Henry C., 435 _ 

Temporary National Economic Com- 
mittee, 464 

Thomas. Brynmor. 460 
Thomsen. Frederick Lundy. 464 
Thurstone, L. L., 457, 466 
Tinljergen, J., 435, 465 
Tocker, K* 

Tolley, Howard R., 187, 435, 459 
Trclogan, Henry C., 451, 462, 464 
Tretsen, J. 0., 459 

Ventura, M. Matcus, 466 
Vernon, J. J., 459, 461 
Viton, A., 463 
von Sacliski, Victor, 464 

Waite, Warren C., on graphic curvilinear 
correlation, 2/7, 278 

text on price studies, 451, 462. 464 
Wakeley, Ray E., 120 
Wald, Abraham, 314 _ 

Wallace, Heno' A., 187, 435, 459 
Wang, Hsl, 460 . 

Waugh, Frederick V., on computation. 

511 . 

on fitting joint functions, 4" 
on potato yield and weather, 3/2, 436, 

459 


541 


V .ir-*: 




W.augh. Frcdetkk V.. >'n i\-.\ 
apple priori. 375 

on fjuality .md vcpc'.ah’e 4 

on validity of nitiltip*'* - -jn 
pictorial chart by, 452 
pionC'T in U'^e of coTeiiticn, 4^3 
5 tudy of N.J. fv-oto rrir", 4? 2 
Weeks, Angelina i... 465 
Wellman, H. R.. 27S 
Wells, 0. V., ‘lal 
Westbrook, E. C., 4fA 
Whitcomb. V.’. D.. 46' 

Widemand, Harley, 466 
Wilson. C. F., 450 
Winch, W. H., 465 
Witmer, Helen Leb.nd. 465 
Wold, Herman, on tecair-iv- r 
432 

on time ‘cric'', 32.’', 441 
text on demand. 355, 3'n. 42i, 446 
451, 463 

Wolfe. T. K., 4C0 
Woollam. G. E.. 463 
Working. E. j., 27? 

Working. Holbro-k. on error of r-er’ •- 
^ion line, 

on p-r.tato price diiTc.'cnti.ab. 464 
on pot.ato prir-', 459, 46r 
pioncir in price rn.aly-’.r. 435 
Welle, Kathran H„ 256, 465 

Ych, Martin H . 467 
Yule, G. Udny. on nultip’’ c-rrc..(H: 
rr.cthtKls, 4,35, 55S 

on stratified Kimp’e bt . , 

on t. .me 'erics ccrrelit’o'i'. -'-r. • *' 



Subject Index 


hilroduciory Hole. Subjects are classifierl wlicn tliey relate to reetrci-- 
sion and correlation issues, or when they have been u'eci a= examph s 
in exercises. The many subjects of indbadua! studies mentioned in 
Chapter 25 are not inde.xed, however. 


Abscissas, defined, 11 
Accuracj’ of estimate, for multiple curvi- 
linear regressions, 249 
for multiple linear regressions, 1S8 
Accuracy of observ'alions, effect on 
results, 311 

Adjustments for number of observations 
and constants, 300, 302 
Agricultural economics, studies in, 441 
Algebraic fitting, for joint function, 363 
for multiple regression curves, 205 
for simple regression curves, 83 
Analysis of variance, 3S8 
basic principles of, 396 
defined, 396 
summary of, 411 

Appearance and productivity, studies of, 
438 

Apple prices, illustrating joint function, 
375 

Arithmetic average, defined, 2 
Assumptions for graphic multiple regres- 
sion curs’cs, 215 
Astronomy, studies of, 440 


Auto sales, of European cars, -if'e 
Auto-stopring, ured as example, 35, 100 
Autocorrelttion. adjustrneri- !< r. .'M 
significance of, 337 
Average, .'j-ithntctic, sf' .'Irith.'rctic 
average 

variance, defined, 3^7 

Beef consumptio.'!, r' es.arnplc. 

Beta coefiintnt. defined, 
for multiple rcr.''e^';ro-, 192, !96 
.'ampiing '^sgnificar.rc t fi.’S 
Bias in .'P.mpling. 26 
Biorr.e'.nka, 434 

Card t.abulators, in s”rp!<' regrc-''f 13; 
Causation, in foci.t! r'"!envf-5. 2 
Check sum. defined, 
use in cornp’.'tatioa'. 

Chemical cinnictcri ttes, .-vd-.-' 4." 

Children's c’othinr ‘tej 4a4 
Coeificient of CDrrc’alien, r'c C' rrcl'.t; v 
cocf;' d int 

i Coefnrfcnt t-f tietenniriti n, l.n 


543 



544 


Coefficient of multiple contUtion 190 
computation by Doolittle method 496 
computation b> Friedman Foote 
method 519 

msiltvple dewtmwuuon 
191 

Coefficient of partial corrtbtion 192 
Cocffiaent of regression 131 
Commodity prices stud es of 419 
Compuution methods for multiple re 
gression b> Doolittle mcth^ 489 
b> Fnedman Foote method 507 
for systems of simultaneous equations 
413 

Computers electronic iit Electronic 
computers 

Conditioned parabob 98 
Conditons on curses com yield prob 
lem 213 

Steel cost problem 257 
Confidence intervals for estimated val 
ues 65 

(or individual forecast 320 
for means 21 
Constant defined 32 
Consumption rebted to price 446 
Coordinates defined II 
Corn response to fertil zer 402 
example of joint correbtion 409 
Corn yield example for multiple curvi 
I near rcgresMon 111 
example of sampling 3 
example of time senes 330 
Correbtion affected by sample selection 
310 

measurement of importance of reb 
tton 126 

Correbtion coefficient adjustment for n 
and m 301 
defined 127 
equation for 127 
sampling significance 399 
standard error of 293 
Correbtion index multiple correbtion 
251 

partial correlation 252 
simple correlation, 123 192 198 
Correlation model defined, 280 
relation to selection of sample 310 
Correlation ratio 378 


5ob/ecf Index 

Cotton irrigation as correbtion eumptc 
135 

sampling significance 269 
Covariance defined 137 
effects Oft vtirvatvce analysis 411 
Cous and farm income example tS7 
Cross-ebss fication 388 
as multiple regression method 389 
for many varubles 394 
for three variables 390 
Cubic parabola charactensctcs of 72 
method of fitting 87 
Cufvil near function determinat on of 
69 

fitted by freehand curve 103 
Curvil near regression 69, 140 
interpretation of, 144 
testing by variance analysis 405 

Degrees of freedom adjustment for 124 
300 

Dependent varublf defined 47 
in multiple regression 171 
Determination coefficient of 130 
index of 131 
proof of meaning 531 
Deviation mean 7 
standard 8 

Discontinuity variable 343 
Distance to fall illustration 33 
Distance to stop of auto example 39 
Doolittle method 489 493 
Dot chart method of constructing 34 
Drift line defined 259 

EcoHontelrtca 435 
Education studies in 456 
Egg quality as qualitative variable 380 
Electronic computers for multiple curvi 
linear regressions 247 
for multiple linear regressions 165 
Endogenous variables 418 
Equation of a function 37 
limitations to 101 
types of 70 

Equilateral hyperbob 74 
Error standard see Standard error 
Errors due to rounding c4I 534 
tn both variables 3 13 
in dependent variable 312 



Svbjed Index 


545 


Errors, in independent variable, 313 
Estimating water flow, example, 57 
European auto sales, example of re- 
search, 468 

Exogenous variable, 418 
Extrapolation of regression equation, 
reliability of, 322 

Falling distance, illustrative problem. 33 
Family consumption, as example of log- 
arithmic equation, 111 
Farm income, acres, and cows, as ex- 
ample, 157 

Farm organization, studies of, 443 
Farm value, as multiple linear regression 
example, 152 
studies of, 441 
F-ratio, defined, 397 
table of its significance, 398 
First differences, effect of using, 340 
Fitted curve, interpretation of, 108 
Fitting linear multiple regressions, by 
electronic computers, 185 
by linear equation, 170 
for any number of variables, 181 
for three independent variables, 177 
practical methods for, 199 
Fitting logarithmic equation. 111 
Fitting multiple curvilinear regressions, 
204 

by electronic computers, 247 
by mathematical equations, 205 
by short-cut graphic methods, 254 
by successive approximations, 210 
Fitting simple parabola, 83 
Fitting systems of simultaneous equa- 
tions, 413 

Food consumption data, as example, 1 1 1 
Forecasting, practical comments on, 344 
Freehand curve, 103 
Freehand fitting, cautions in, 106 
Frequency distribution, illustrated, 11 
Frequenej' table, defined, 5 
Function, 32, 36 

curvilinear, see Curvilinear function 
Functional relation, defined, 36, 43 
determining, 39 

Graphic curve, fitted freehand, 103 
Graphic interpolation, 86 


I Graphic mcih'vi, fer cen;!ir<'.ir rr.-rt-. 

25-t 

for join; funt'.bn, 3i'3 
Graphic rtyrt.'-ion cur%e'. rri'cl '.Kt',', 
Graphic rtprexnmtbn, of t? o va'”b!'-- 
34 

Gravity, dlii^tratin;' quadratic rquaticr;. 
75 

used as illu^i.-ation, 37 
Gro-s rc.7fc---tj ja co"f:;ricnt, 154 
Group .-ivtnircf., rdialilisy cf, 51 

i Hay stack votutr.e, as illu-.tr.ttiivj. jv.t 

J Hyperb-jb, chamctcrbiim o; 73. 74 
method of fitting, 9S 
Hypothesis, devebping fc.- .v.;to t.-.L- JO 

Indepjndc.ot variabie, d.-fir.ed. 47 
Independent v.iriald'^, la ttiulitplc re- 
gtesdon, 17! 

Index of detcrntin.'ition. 131 
Index of multiple (orrelaf.on, 251, 271 
for joint functiun. 366 
Index of muitjple rietmuinifso, I'-O 
for tnultjple cur. il-ntrr rt::rr:''-.r,', 
25! 

Index of net regre-- .or., 252 
Individual force.'-:, teli drlity of, 31S 
Industrui! .ippl-.c-itscn'. itud-.c? m, 440, 
453 

Input-output rfbuio.'", 'tudi' - rf, 4 j; 

1 Intemctiorr.i trade, studie. in. AS', 
Inttrqu.irtilc rnrpe, d. f:.~.!,d. 4 
Interpreting, a !n;».d fui-'. e, lO" 
a jcrnl fun-ti'n. iO 
the multiple rtyrer'-'or cqu.ti'.r, bd 
Irngatiou preL!c:n, .cs tf time 

.ero, 329 

Joint fu.ncttor), tiefir.-d. 352 
fitte-J .'.Iq* b- uc-iiy, 363 
fittc-i gmip'-'calU , 35'; 
for 3 wr’.ebles. 373 
for tv.ri Y irt.tb't', 352 
for thrt-r r- r.i'i-c v.i.'; il It-', 3 ■•''i 
ind!r,tU"rr in rt cv.X tr.r;.: >J. zOi 
Jc’.nl rega'-ucr. 'urfacf*, 3:'-: 
Ju.'-t-iden'.ifii-i rv-'i'.-l. -'20 



546 

Linar equation 56 
cwnputauon of 135 
dfiermmed b> semi avtfagts 66 
Acting b> least »quir« 61 
interpreting 63 
usefulness of 67 
Linear function grapb of 57 
Linear multiple regression defined 171 
method of fitting 170 
Dncar regress on equation defined 134 
interpretation of 138 
Logarithmic curve income elasticity 
from 111 

method of fitting 90 
Loganthm c functions tharacmistics of 
73 

Logical limitations in freehand fitting 
107 

Logical 1 gnificance of mathematical 
functions 74 

Macroeconomic relations studies of 455 
Manufacturing and m grat on as ex 
ample 119 

Marble falling as illustration 33 
Marketing units stud es of 443 
Mathematical equation in economic 
problem IIQ 
when to use 109 
fifean arithmetic 3 
standard error of 19 
Nfean deviation 7 
hfedian defined 4 
Migration and manufacturing as ex 
ample 119 
Model defined 39 
for regression analysis auto sales ex 
ample 471 

for simultaneous equations system 417 
two-variable 39 

see also Correlation mode! and Regres 
8 on model 

Mortality rates illustrating joint func 
tion 3S0 

Multiple correlation coefficient 190 192 
reliability of 295 

sampling fluctuation illustrated 299 
Multiple curvilinear regressions 204 
fitted by successive approximations 
210 


StAfeef Index 

Multiple curvilinear regrtssons inter- 
pretation of 234 
limitations on use of muUs 240 
reiubiliiy and use of 245 
short-cut graphic method 254 
standard error of estimate for 249 
use of 245 

Multiple linear regressions 151 
by successive elimination 151 
nomenclature for 176 
practical methods for fitting 199 
Multiple regression equation 176 
defined 153 

for three independent variable* 177 
for two independent variables 171 
interpretation of 183 
rclubilit) of forecast from 320 

Natural sciences 1 
Net regression 176 
Net regrwsion coefficient defined 153 
interpretation of 181 
standard error of 283 
Normal distribution defined 9 
illustrated 10 

NiM-mal equations for linear equation 62 
fortwomdependentvariables 171 

Observation equations 59 
Ordinates defined 11 
Orthogonal polynomials 90 
Ovettdenlified models 428 

Parabob characteristics of 71 
method of fitting 83 
Parameters defined ll 12 
Partial corrcbtion coefficient 192 
computation by Doolittlemethod 503 
by Foote-Friedman method 518 
Parlial regression coefficient 176 
see also Net Regression coeffiaent 
Political science studies in 456 
Pork supply and demand as illustra 
tion 420 

Potato yield example of joint function 
372 

Predetermined variable 418 
Price margins studies of 450 
Pnees of farm products studies of 444 
of industrial products studies of 453 



£47 


Sub/ed Index 

Prices, of fcparatc lots, afTocted by 
quality, 4S0 

Prolxibility, in sampliny, 28 
Product moment, defined, 137 
Production, affected by price, studies of, 
447 

Production function, in agriculture, 
studies of, 437 
in industry, studies of, 453 
Productivity, related to appearance, 
studies of, 438 

Projectile, illustrating parabolic equa- 
tion, 77 

Protein in wheal, used as example, 81 
Psychology, studies in, 456 

Quadratic equation, characteristics of, 75 
Qualitative factor, used as independent 
variable, 378 

in multiple correlation, 380 
in simple correlation, 378 
Quartilcs, defined, 4 

Random selection, 2 
Randomness, in time-series residuals, 333 
Recursive system, defined, 472 
Reduced-form equation, defined, 419 
Regression analysis, meaning of, 43 
Regression coefficient, 134 
effect of sample selection, 306 
net, defined, 153 
Regression curve, defined, 134 
standard error of, 290 
Regression equation, 134 
for two independent variables, 171 
linear, defined, 134 

multiple, see Multiple regression equa- 
tion 

Regression line, defined, 134 
error of individual forecast from, 319 
Regression model, 280 
relation to sample selection, 310 
Regression relationships, changes in time. 
343 

Regression results, tests of, 458 
Reliability, of correlation coefficients, 
293 

of individual forecast, 318 
of multiple correbtion coefficients, 295 
of sample, 18 


Repeated oi. 

Rounding dim. 2 
e.'fcctJ of, 53-! 

Sales quotas ^tudi- cf, 43? 

Sample, defin-'d, 11 
effects of ti, 50 ', 

reliability o), !R 

sire ncf'dc-d for given r, I'-'bii'f.-, 27 
spot, IS 
stratified, IS 

&imp!ing, .lisumpiien'-. m. 15 
bias, 26 

significanrc, of ronchti' n indw.' 205 
fiamplin,'' fluctuation, n muiti; or- 
relation coeffirierit'. 205 
Scatter diacram, r!e.fir!''d. 55 
Sielcction of s-Mup!., corclu'-nr, cr. 316 
Stmi-averagc-s, fiuinr hr/.- f>'i 
S’nal correlation, deficed. ’37 
Short-cut graphic rr.cti.'v! 251 
graphic aid for, 52^' 

Simultaneous equation' metfi'-'i, 41' 
cautions on use of. 4’1 
Soc’ial snoiico', vahtt <! sta’i-tje, f-r. I 
SjK>t samp'e, 15 

Stand'trd dcvuiti- n, nlculatuT. fr ra fre- 
quenc'e?, 12 
defined, S 

•Standard error, in'e.'prctat!'''' of. 25 
of correlation coefiki. 25,1 
of graphic revre- '■ion e rrae, 2^0 
of the m'ian, 19 
of reg.-oi-ion Cf> tfi rent. 231 
cl rT>t;rci-i''>n cu-cvts fist' 1 r’l’l.f*- 
maiiciUy, 295 
of regres-i-n line 257 
of the .stuodirii ( 2' 

use of, 24 

Standard t.-ror cf r titnat U' 

adjU'tnie.c.t for dv.-fi- • 

12\ 591 

equr- tion , fo-, I ? 1 
for cure ilir-c-'.r riyrc 12’ 

for jo'nt fut'.'.ion 
for rniiltipi" f j,r. .fincir n 
249. 27'' 

for rr.ultip'e ’..r'-tr rc-~rL' "e, . l''i 



S4S 

Standard trwr of individual foreeait 
3i8-v32i 

Sundard mors for time miei 325 
Standard mors of regression coefficient* 
computation b> DooIitlJe method 499 
compuution b> Foote-Fnedman 
method 517 

Stating conclusions from )oint function 
369 

from multiple curvilinear regressions 
234 

from regression and correlation 
anal)**!! 476 
Statistic* defined 11 13 
Steel cost example for short-cut method 
255 

example fortesungautocorreUtion 339 
Stratified sample IS 
Stream flow example for linear regres 
sion 57 

Student * d stribution 281 
Successite elimirutiofl method of 167 
Supply-demand income-price inter 
actions studies of 449 
Syitemt of wmultaneous equation* 413 

/ ratio defined 396 
Time series autocorrelation m 334 
error formula for 325 
method of fitting tines or curve* 90 
studies in 441 

Types of equation logical basis for 473 

Universe defined 14 
for corrcbtion model 280 


Svbltd Injtt 

Universe for * regression model,' 2W 
Universes past and present 29 
Utility rates studies c^ 454 

Vanabilit> mcasunng 1 
Vanables defined 32 
relation* betn-een 32 34 
units in which stated 472 
Variance ‘ a ’ defined 397 
the defined 10 
Variance analysis 388 
multiple curvilinear regression 408 
of differences betiveen treatment* 397 
402 

testing additional terms in simple re- 
gression 403 

telling curvilineanty of regression 405 
testing significance of t«o or more 
pnnaples of classification ' 406 
Vitreous wheat kernels as example, 82 
Volumes of objects studies of 440 
von Neumann's raiig 336 339 
significance of 341 

Water flow, example of simple correla 
tion 57 
studies of 439 

Weather conditions and crop yields 
studies of 436 

Wheat, vitreous kernel* as example 
81 

Yield of com a* example of multiple 
curvilinear correlation 2l\ 
other factor* affecting 243 



